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Memories are stored, retained, and recollected through complex, coupled processes 
operating on multiple timescales. To understand the computational principles behind 
these intricate networks of interactions we construct a hroad class of synaptic mod¬ 
els that efficiently harnesses biological complexity to preserve numerous memories. 
The memory capacity scales almost linearly with the number of synapses, which is a 
substantial improvement over the square root scaling of previous models. This was 
achieved by combining multiple dynamical processes that initially store memories in 
fast variables and then progressively transfer them to slower variables. Importantly, 
the interactions between fast and slow variables are bidirectional. The proposed mod¬ 
els are robust to parameter perturbations and can explain several properties of biolog¬ 
ical memory, including delayed expression of synaptic modifications, metaplasticity, 
and spacing effects. 


Introduction 


The complexity and diversity of the numerous biological mechanisms that underlie memory is both fas¬ 
cinating and disconcerting. The molecular machinery that is responsible for memory consolidation at 
the level of synaptic connections is believed to employ a complex network of highly diverse biochemical 
processes that operate on different timescales (see e.g. Kandel et al.[|2013] Bhalla[|2^014 l. Understanding 
how these processes are orchestrated to preserve memories over a lifetime requires guiding principles to 
interpret the complex organization of the observed synaptic molecular interactions and explain its com¬ 
putational advantage. Here we present a class of synaptic models that can efficiently harness biological 
complexity to store and preserve a huge number of memories on long timescales, vastly outperforming 
all previous synaptic models of memory. 


The models that we construct solve a long-standing dilemma: in a memory system that is continually 
receiving and storing new information, synaptic strengths representing old memories must be protected 
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from being overwritten during the storage of new information. Failure to provide such protection results 
in memory lifetimes that are catastrophically low (Amit and Fusi 1994[ Fusi[ [2002 Fusi and Abbott 


20071. On the other hand, protecting old memories too rigidly causes memory traces of new information 
to be extremely weak, being represented by a small number of synapses. This is one aspect of the 
plasticity-rigidity dilemma (see also Me Closkey and Cohen 1989| Carpenter and Grossberg[ 1991 


[McClelland et al.[ 1995 Fusi et al. 20051. Synapses that are highly plastic are good at storing new 
memories but poor at retaining old ones. Less plastic synapses are good at preserving memories, but 
poor at storing new ones. 

Previous theoretical works have estimated the consequences of the plasticity-rigidity dilemma on the 
memory performance for various synaptic models characterized by different degrees of complexity. For 
many years, long-term potentiation of synapses was represented, at least by the modeling community, 
as a simple switch-like change in synaptic state. Memory models studied in the 1980’s (see Hopfield[ 


19821 suggested that networks of neurons connected by such synapses could preserve a number of mem¬ 


ories that scales linearly with the size of the network. However, subsequent theoretical analyses (Amit 
and Fusij 1994t Fusi[ 2002| Fusi and Abbottj [2007 1 revealed that what had appeared to be a harmless 


assumption in the theoretical calculations was actually a fatal flaw. The unfortunate approximation was 
ignoring the limits on synaptic strengths imposed on any real physical or biological system. When these 
limits are included, e.g. in the extreme case of binary synapses in which the weight takes only two dis¬ 
tinct values, the memory capacity grows only logarithmically with the number of synapses N for highly 
plastic synapses, and like ^/N for more rigid synapses that are able to store only a small amount of 
information per memory. 


A possible resolution of this dilemma is to make each synapse complex enough to contain both highly 
plastic and rigid components. In many models the plastic components are represented by fast biochem¬ 
ical processes, which can change rapidly to acquire and store a large amount of information about new 
memories. This initial memory trace is strong but labile; it decays quickly when other memories are 
stored. Memories can be consolidated if the information about each new memory is progressively trans¬ 
ferred to the slow components, which can preserve memories on longer timescales. This mechanism is 
widely used in artificial devices (e.g. computer memories, which include fast RAM and hard drives), it 
was proposed to explain memory consolidation at the systems level (McClelland et al. 1995[ Roxin and 


Fusi||2013 1, and it was incorporated into a model of synaptic memory based on a cascade of biochemical 


processes that operate on different timescales ( [Fusi et akj 20051. This form of synaptic complexity can 
greatly extend memory lifetimes without sacrificing the initial memory strength, accounting for our re¬ 
markable ability to remember for long times a large number of details even when memories are learned 
in one shot ( Brady et al.[ 20081. The two quantities that characterize memory performance, memory 
lifetime and the strength of t he initial memory trace, scale like the square root of the number of synapses 
(\/]V) in the cascade model (Fusi et al. 20051. 


Here we show that these models can be significantly improved when the network of interactions between 
the multiple biochemical processes that control the synaptic dynamics is bidirectional and appropriately 
tuned. In this case, the decay of the memory trace is substantially slower than in all previous models, 
leading to a memory lifetime that scales almost linearly with the number of synapses. Importantly, in 
our model, improved memory lifetime does not require a systematic reduction in the initial memory 
strength, which also scales approximately like the square root of the number of synapses. Although the 
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proposed synaptic model requires tuning, it is robust to noise and variations in its parameters. Finally, we 
construct a broad class of synaptic models that are equivalent in terms of memory performance. These 
different models capture the complexity and diversity of biochemical processes believed to be involved in 
memory consolidation. Thanks to their complexity, they can also reproduce the rich phenomenology of a 
plethora of biology and psychology experiments, including power-law memory decay ([Wixted and Ebbe 


sen 


1991[[T997 I, synaptic metaplasticity ( Abraharn| 2008 1 , delayed expression of synaptic potentiation 


and depression, and spacing effects (see e.g. Anderson 19951. 


The memory benchmark 


To study the process of storing multiple memories and to benchmark memory models we need to make 
assumptions about the nature of memories. Storage of new memories is likely to exploit similarities with 
previously stored information (see e.g. semantic memories). In what follows, we focus on mechanisms 
responsible for storing new information that has already been preprocessed in this way and is thus in¬ 
compressible. For this reason, we consider memories that are unstructured (random) and do not have 
any correlations with previously stored information (uncorrelated). Although this may appear to be a 
strong and limiting assumption, it is widely considered as the standard benchmark for synaptic mod¬ 
els, mainly because theoretical studies on random and uncorrelated memories are often predictive of the 
scaling properties of the memory performance in more general cases (see e.g. the case of the perceptron, 
[RosenblaTt 1958 1. 


Consider an ensemble of N synapses which is exposed to a continuous stream of modifications, each 
leading to the storage of a new memory. We express the assumption that the stored memories are un¬ 
structured by hypothesizing that the synaptic modifications are random and uncorrelated. Each synapse 
thus experiences a random sequence of potentiations and depressions, and the sequences are different 
and uncorrelated for different synapses. The memory of an event is defined by fhe paffern of N synapfic 
modificafions pofenfially induced by if. We will selecf arbifrarily one of fhese memories and frack if over 
fime. The selecfed memory is nol differenf or special in any way, so fhaf fhe resulfs for fhis particular 
memory apply equally fo all fhe memories being stored. 


To frack fhe selecfed memory we fake fhe poinf of view of an ideal observer fhaf knows fhe sfrengfhs 
of all fhe synapses relevanf fo a particular memory frace (see e.g. Fusi 2002 Fusi ef akj 20051. Of 
course in fhe brain fhe readouf is implemenfed by complex neural circuifry, and estimates of fhe sfrengfh 
of fhe memory frace based on fhe ideal observer approach may be significanfly larger fhan fhe memory 
frace fhaf is acfually usable by fhe neural circuifs. However, given fhe remarkable memory capacify of 
biological sysfems, if is nof unreasonable to assume fhaf fhe readouf circuifs perform almosf optimally. 
Moreover, we will show fhaf fhe ideal observer approach predicfs fhe correcf scaling properfies of fhe 
memory capacify of simple neural circuifs fhaf acfually perform memory refrieval (see fhe Discussion 
and Suppl. Info. S.I2i. 


More quanfifafively, we define fhe memory signal as fhe correlation befween fhe sfafe of fhe synapfic 
ensemble and fhe paffern of synapfic modificafions originally imposed by fhe evenf being remembered. 
Previously sfored memories, which are assumed fo be random and uncorrelafed, make fhe memory frace 
noisy. Memories fhaf are sfored affer fhe fracked one continuously degrade fhe memory signal and also 
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contribute to its fluctuations. We will monitor the signal to noise ratio (SNR) of a memory, which is 
defined as the ratio between the memory signal and its standard deviation (see Methods M. 1 for a more 
formal definition). One measure of memory performance is the memory lifetime, the maximal time since 
storage over which a memory can be detected, i.e. for which the SNR is larger than one. The scaling 
properties of the memory performance that we will derive do not depend on the specific choice of the 
critical SNR value, as long as it is of order one. The memory lifetime is also a measure of the memory 
capacity because all memories that have been stored more recently than the tracked one will have a 
larger SNR, and hence if the tracked memory is retrievable, then all the more recent memories will be 
retrievable a fortiori. 


Constructing the synaptic model 


The value of a synaptic weight w at any given time is typically the result of multiple synaptic modifica¬ 
tions. To build an efficient synaptic model, it is instructive to start from an abstract memory model in 
which the present weight is expressed as a sum of synaptic modifications Aw, weighted by a factor r 
that decreases with the age of the modification t — ti, where t is the current time and ti is the time of 
the Ith modification. In this case the signal of the corresponding memory would decay as r{t — ti). The 
noise would be approximately proportional to square root of the the variance of w at time t, 

Var('u;(f)) = ^ [Aw{ti) r{t - ti)f , (1) 


where we have assumed that the average of Aw{ti) is zero, which is equivalent to hypothesizing that 
synaptic potentiation and depression are balanced. A slowly decaying r would enable the synaptic weight 
to depend on a large number of modifications, but it would also induce a large variance for w{t), poten¬ 
tially arbitrarily large if the sum extends over an arbitrary number of modifications. On the other hand, 
fast decays would limit the memory capacity. From eqn. Q it is apparent that in the case of random 
and uncorrelated modifications, the slowest pow er-law decay one can affor d while keeping w finite is 
approximately r{t) ~ (see also Methods M.2i. In Suppl. Info. S.I we show that under some 

conditions this is approximately the optimal solution among all possible functional decays (see also the 
Discussion). 


This abstract model reveals what kind of decay of the memory signal is desirable, but it does not explain 
how this behavior is achievable by synaptic dynamics. The next step is to construct a model that im¬ 
plements the desired power-law decay. One simple way would be to endow each synapse with a timer 
and introduce a mechanism to decrease the relative weight of each synaptic modification on the basis 
of the age of the modification (see e.g. Wu and MeH 20091, but this would just move the problem to 
the encoding and preservation of the age of a memory, which is potentially as difficult as the original 
memory problem we intend to solve. As we will show, there is no need for a timer, as there are synaptic 
models in which the l/\/t decay emerges naturally from the interaction of multiple processes. 


We will start with the construction of a simple chain model that captures and illustrates all the relevant 
scaling properties of more complex models. Then we will show how to generalize the model to incorpo¬ 
rate less orderly interactions that are more similar to those observed in biological synapses. The simple 
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chain model is described in Fig. and is characterized by multiple dynamical variables, each repre¬ 
senting a different biochemical process. The first variable, which is the most plastic one, represents the 
strength of the synaptic weight. It is rapidly modified every time the conditions for synaptic potentiation 
or depression are met. For example, in the case of STDP, the synapse is potentiated when there is a pre- 
synaptic spike that precedes a post-synaptic action potential. The other dynamical variables are hidden 
(i.e. not directly coupled to neural activity) and represent other biochemical processes that are affected by 
changes in the first variable. In the simplest configuration, these variables are arranged in a linear chain, 
and each variable interacts with its two nearest neighbors. These hidden variables tend to equilibrate 
around the weighted average of the neighboring variables. When the first variable is modified, the sec¬ 
ond variable tends to follow it. In this way a potentiation/depression is propagated downstream, through 
the chain of all variables. Importantly, the downstream variables also affect the upstream variables as the 
interactions are bidirectional. 


To gain insight into the way this type of synapse works, it is useful to resort to an analogy with a set of 
communicating vessels, a more intuitive physical system. This analogy is illustrated in Fig. [IP- Each 
synaptic variable is represented by the level of liquid in a beaker. The interactions between variables 
are mediated by tubes that connect the beakers. The first beaker (yellow) represents the synaptic weight. 
The synapse is potentiated by pouring liquid into it, whereas depression is implemented by removing 
liquid. As the liquid level deviates from equilibrium, the fluid flow through the tubes will tend to balance 
the level in all beakers. The balancing dynamics is fast when the beakers are small and the tubes large, 
but slow for large beakers and small tubes. A single synaptic modification is remembered as long as the 
liquid levels remain significantly different from equilibrium. 

We now show how to construct the desired synaptic memory model by considering the analogous sys¬ 
tem of communicating vessels. An efficient memory system should have both long memory lifetimes 
(i.e. long relaxation times) and a large initial memory strength, obtained with a relatively small number 
of variables (i.e. number of beakers). It is possible to build a system in which the memory strength 
decays like a power law (approximately 1 / \/t) and that only requires a number of variables that grows 
logarithmically with the memory lifetime. 


We will construct this system in three steps, progressively increasing the number of tuned parameters 
to improve memory performance. First, consider a series of identical beakers that are arranged in a 
linear chain and connected by a set of tubes with equal cross sections (see Fig. [Tp). When the first 
variable ui is perturbed, for example by adding liquid to the first beaker, liquid starts flowing to the other 
beakers, relaxing towards equilibrium. This relaxation dynamics is illustrated in Fig. for a system 
with 31 beakers. In the first plot, the liquid levels of all beakers are shown at three different times. The 
perturbation starts from the first beaker and then slowly spreads to all the other beakers. This process, 
which is analogous to heat diffusion (see Methods M.4| ), is characterized by a decay of the perturbation 
that follows a power law (l/\/f), at least for a time period that scales quadratically with the number of 
beakers, after which it becomes exponentially fast. This system has the desired decay properties, but it 
requires an unreasonably large number of beakers. A synapse based on this mechanism would require a 
number of biochemical processes (each process being equivalent to a beaker) that scales like the square 
root of the number of storable memories and can be as large as the square root of the total number of 
synapses. 


Interestingly, it is possible to have a comparable memory performance with a significantly smaller num- 
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ber of variables. We can combine together multiple beakers, as shown in Fig. and construct a system 
with a number of beakers that scales only logarithmically with the memory lifetime. The first beaker 
remains the same as in the original linear chain. The next two are merged into a larger beaker with twice 
the cross-sectional area, which contains the same volume of liquid as the two original ones. Then, the 
next four beakers are combined together into a larger one, and we repeat this merging procedure until 
we reach the end of the chain. At each step the number of original beakers that are combined dou¬ 
bles. This implies that the variables describing the system operate on different timescales that increase 
exponentially as one moves along the chain. 


While this merging procedure dramatically reduces the number of beakers, the convergence to equilib¬ 
rium is now significantly faster than before. In the original system, equilibrating two distant beakers takes 
a time that scales quadratically with the number of intermediate beakers. If these intermediate beakers 
are merged into one, the required time is drastically reduced, which leads to a much faster memory decay 
(~ 1 /t) than in the previous case, as illustrated in Fig. HP- 


Fortunately it is possible to recover the slow decay, without increasing the number of beakers, by tuning 
the cross sections of the tubes, as shown in Fig. [IP- When the identical tubes are replaced with progres¬ 
sively smaller ones (by powers of two), the decay slows down and follows l/y/i over a time period that 
grows exponentially with the number of beakers. This means that it is possible to construct an efficient 
synapse whose memory decays in the optimal way and that requires a number of biochemical processes 
that grows only logarithmically with the longest memory lifetime (see also Section M.3| ). We now show 
that these features are preserved when we consider a population of synapses storing multiple memories, 
even if the synaptic dynamical variables can vary only in a limited range and their values can only be 
preserved with limited precision. 
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Figure 1 (following page): A. Schematic of a simple synaptic plasticity model. The dynamical 
variables represent different biochemical processes that are responsible for memory consolidation 
(k = 1, where m is the total number of processes). They are arranged in a linear chain and in¬ 

teract only with their two nearest neighbors (see differential equation), except for the first and the last 
variable. The first one interacts only with the second one (and is also coupled to the input), while the 
last one interacts only with the penultimate one. Moreover, the last variable Um has a leakage term that 
is proportional to its value (obtained by setting Um+i = 0). The parameters gk,k+i are the strengths 
of the bidirectional interactions (double arrows). Together with the parameters Ck they determine the 
timescales on which each process operates. The first variable ui represents the strength of the synaptic 
weight. B. The schematic model of A behaves like a set of communicating vessels. The Uk variables 
measure the deviation of the liquid level from equilibrium, shown in the third beaker as a blue dashed 
line. The Ck represent the sizes (areas) of the beakers, and the coupling constants gk,k+i correspond to 
the cross-sections of the connecting tubes. Again, the liquid level in the first beaker (yellow) represents 
the synaptic strength. The last beaker is connected to a reservoir whose liquid level is always at equilib¬ 
rium. This interaction represents the leak in the differential equation of Um- C. Relaxation dynamics in 
a set of 31 identical beakers connected by tubes of equal size (Ck = l,gk,k+i = 1/8). A perturbation 
of the liquid level of the first beaker propagates to the others, slowly disappearing. The 31 variables 
are shown in the middle at three different times and the decay of ui, which approximates a power law 
(l/Vi), is plotted on the right on a log-log scale. D. A new set of beakers is obtained by merging those 
of panel C. The number of merged beakers progressively increases, leading to successively larger ones 
(Ck = 2^“^). The cross-sections of the tubes are still all identical (as indicated by the blue ovals). The 
number of variables is now significantly smaller, but the decay is too fast (l/t). E. Completely tuned 
set of communicating vessels: the sizes of the tubes connecting the beakers are progressively reduced to 
slow down the decay (gk,k+i = 2“^“^), which now follows the desired l/\/f behavior as in C, but with 
a number of beakers that scales as the logarithm of the original number. 
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Discretization of the dynamical variables and scaling properties 


It is clearly unrealistic to assume that each dynamical variable can vary over an unlimited range and 
be manipulated with arbitrary precision when Uk represents a physical quantity, such as the number of 
molecules in a particular state, which is typically relatively small given the size of a synapse (e.g. there 
are only tens of CaMKII molecules at each synapse). Therefore, we discretize the and impose rigid 
limits on them. The dynamics of the model is now described by stochastic transitions between a discrete 
set of levels for each variable, arranged to approximate the continuous system constructed above. At 
every time step, the Uk are first updated as in the case of continuous variables described above, but then 
each variable is discretized by setting it stochastically to one of the two values in the discrete set that are 
closest to the updated value. The probabilities of ending up in each of the two values are chosen so that 
the average of the discretized matches the continuous Uk (see Section M.5 for details). 


Assuming that memories are presented at a constant rate of one new uncorrelated memory per unit of 
time, we show in Fig. the SNR as a function of time for memory models in which the complexity of 
the synapse progressively increases - the number m of variables varies between 4 and 10. The curves are 
plotted on a log-log scale, so a straight (downwards) line corresponds to a power-law decay. In all cases, 
the SNR decays approximately as 1/ \/t, as expected, over a time interval that increases exponentially 
with the complexity of the synapse. Then the decay accelerates and becomes exponential. Therefore, the 
corresponding memory lifetime increases exponentially with m (see Fig.[^) up to a limit of order N, the 
total number of synapses. Conversely, increasing the number of synapses while keeping m fixed, we find 
a memory lifefime fhaf grows linearly wifh N (see Fig. 13^)^ until if safurafes af fhe longesf timescale of 
fhe memory sysfem (which is exponenfial in m). This safurafion can be avoided by adjusfing fhe longesf 
fimescale appropriafely, which leads fo a memory lifefime scaling as N/ log(A^). 


The memory l ifefime in previou s models of complex synapses wifh bounded weighfs scales af mosf as 
\/fV (see e.g. Fusi ef al. 20051. A memory lifefime fhaf scales (almosf) linearly wifh fhe number of 


synapses consfifufes a major improvemenf, especially in large neural sysfems. For fhe human brain, wifh 
N ~ 10^^, fhe memory capacify would pofenfially be exfended by a factor of almosf 10^. Imporfanfly, 
fhis is achieved wifh a relafively small increase in fhe complexify of fhe synapfic machinery for memory 
consolidafion, as m grows only logarifhmically wifh fhe memory lifefime. Moreover, fhe initial SNR, 
which is relafed fo fhe amounf of information stored per memory, has fhe same scaling wifh N as in 
previous models (i.e. '/N, see Fig.|^), and only decreases slowly wifh m (as Ijy/m, see Fig.|^). 


Robustness of the model 


We now sfudy sysfemafically fhe effecls of discrefizafion on memory performance. In Fig. we plof 
fhe disfribufions of fhe Uk variables across fhe model synapse when each Uk varies over a discrefe sef of 
35 equally spaced values. The maximal and minimal values are rigid boundaries. All disfribufions are 
approximafely Gaussian, wifh a widfh fhaf is largesf for fhe firsf variable ui and progressively decreases 
for fhe ofher Uk as k increases (see Suppl. Info. S.2i. Since almost all values are well within the bound¬ 
aries, the relaxation dynamics of the variables is very similar to the unbounded case and the SNR 
curve changes smoothly when we restrict the dynamical range even further (Fig.[^). Indeed, the width 
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of the broadest distribution (that of ui) scales only like ^/]ogT, where T is the longest timescale of the 
synapse (approximately T ~ Cralgm,m+i = 22™+i). 


Because the distributions are narrower for the slower dynamical variables, one may wonder whether the 
range and the number of levels could be progressively decreased as a function of k without affecting the 
memory performance significantly. This is indeed the case, as demonstrated in Figs. [^,D. When the 
number of equally spaced levels decreases linearly with k, the distributions are very similar to the case 
in which the number of levels remains the same for all variables (Fig. [^). The memory performance 
is almost identical in the two cases (Fig. [^). This implies that the slower dynamical variables do not 
require as much precision as the fastest ones. Slower variables only need a number of levels that can 
be surprisingly small, just two for the slowest one in the example of Figs. |^,D. The slowest variables 
need to preserve their values over timescales of years, and this would likely be difficult to implement if a 
large number of values had to be distinguished. In contrast, bistable processes can easily be stable over 
very long time periods (Crick 1984| [Miller et al.l 2005 Si et al. 20031. For a small number of levels 
that is larger than two, one could combine multiple bistable processes or use slightly more complicated 
mechanisms (jShouval 20051. 


In Suppl. Info. S.3 we show that the model is robust not only to discretization, but also to parameter 
variations, and can tolerate surprisingly large perturbations of the optimal beaker and tube sizes. The 
SNR of the perturbed model deviates from the SNR of the unperturbed model, but the deviation increases 
smoothly with the amplitude of the perturbations. Moroever, the SNR still decays approximately as 


Figure 2 (following page): Scaling properties of the synaptic model. A. Memory signal to noise ratio 
(SNR) as a function of the number of random uncorrelated memories that are stored after the tracked 
memory. The SNR is computed for a population of = 5.4 x 10® synapses. The scales of both axes 
are logarithmic. Different curves correspond to synaptic models with a different number of dynamical 
variables (m = 4,6,8,10). m is also the number of beakers in Figure [T] Each variable can vary on 
a discrete set of 40 equally spaced values. For all curves, the decay follows approximately a power 
law (l/y/i) for a large number of memories and then becomes exponential where the curves visibly bend 
downwards. The range of the power-law decay increases exponentially with m, which is a measure of the 
complexity of the synapse. Memories are assumed to be stored at a constant rate of one new uncorrelated 
memory per unit time, which we choose to be one minute here, so that the SNR decay can also be 
expressed as a function of time (upper horizontal axis). This choice of overall timescale is completely 
arbitrary and time is considered only to help the reader appreciate the wide span of memory lifetimes. 
The memory lifetime is defined as the time elapsed since storage (or number of subsequently stored 
memories) at which the SNR falls below some arbitrary threshold (dashed line). B. Memory lifetime vs 
m. The vertical axis is logarithmic, the horizonal one is linear, so the line that fits the simulation points 
represents an exponential growth. C. Initial SNR, denoted by SNRq, vs m. Both axes are linear. As m 
increases, the initial SNR slowly degrades (~ l/yTn). N = 5.4 x 10® both in B and C. D. Memory 
lifetime vs N, the number of synapses on a log-log scale. The memory lifetime, which is proportional to 
the memory capacity (i.e. the total number of memories that can be stored), increases linearly with N, as 
expected from the 1/Vi decay of the SNR. This is a major improvement over previous synaptic models. 
E. Initial SNR vs on a log-log scale. SNRq grows like VN, as in the best previous synaptic models, 
m = 12 for both D and E. 
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Figure 3: Robustness: Effects of different discretization schemes of the dynamical variables. A. Distri¬ 
butions of the Uk variables for /c = 1, 2,..., m in a population of synapses at equilibrium. The synaptic 
model has m = 12 dynamical variables with 35 discrete values each. The fastest variable is on the left. 
The distributions are approximately Gaussian and become progressively narrower for slower variables. 
B. SNR vs number of stored memories, as in Fig. [^, for discretizations with different numbers of lev¬ 
els (namely 20, 30,40 and 50). C. Distributions of the Uk variables when the number of discrete levels 
decreases progressively for the slower variables. The last variable has just two stable levels. The distri¬ 
butions are very similar to the case in which the boundaries and the number of levels are the same for all 
dynamical variables (panel A). D. SNR vs number of stored memories for constant (black) and decreas¬ 
ing number of discrete levels (green). The performance remains almost unaffected by the reduction in 
the number of levels. 
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1 / y/i. These results indicate that the model parameters do not need to be finely tuned. 


Generalizations of the model 


In the previous sections we considered synaptic models that can be represented by linear chains of dy¬ 
namical variables. We focused on these models because their simplicity allowed us to illustrate the 
computational principles we used to design them. However, they appear too simple and orderly to ac¬ 
commodate the complexity and diversity of biological synapses. Here we show that it is possible to 
construct a broad class of synaptic models that are equivalent to linear chains in terms of memory perfor¬ 
mance. Such complex models can readily be constructed by starting from the undiscretized linear chain 
model depicted in Fig. [T] and then iteratively ramifying it. For example, the second beaker could be con¬ 
nected to two identical beakers instead of one, splitting the chain into two. Each of the two beakers would 
then be connected to a series of progressively larger ones. Pairs of corresponding beakers would have the 
same total capacity as the associated single beaker of the original chain. This ramification process can be 
iterated an arbitrary number of times, with any choice of relative importance weights assigned to differ¬ 
ent branches. Furthermore, such branches can merge again, leading to complex networks of interactions 
like the one shown in Fig. |^, which are still equivalent to the original linear chain. 


In Section M.6 we show that if the cross-sections of the tubes are properly tuned these complex models 
have the same dynamics for the first beaker and therefore the same memory performance as the linear 
chain models. We also demonstrate that they are robust to relatively large perturbations, such as the com¬ 
plete loss of one interaction pathway, which can be partially compensated by parallel branches. More¬ 
over, such complex synaptic models can readily reproduce various experimental observations, which 
include delayed LTP/LTD expression (see e.g. O’Connor et al. 20051, one form of metaplasticity (e.g. 
Abraham 2008 [ O’Connor et al. 20051 and spacing effects (e.g. Andersonj 19951. 


Until now we have considered models in which the synaptic efficacy is insfanfaneously modified by 
adding or removing liquid from fhe firsl beaker. The long-ferm memory performance, however, remains 
basically unalfered when liquid is added or removed from ofher beakers insfead, even fhough fhe ex¬ 
pression of a synapfic modification is delayed by fhe lime if lakes fhe liquid fo flow info fhe beaker 
represenfing fhe efficacy. This suggesls lhal LTP and LTD inducfion profocols may affecl disfincl bio¬ 
chemical processes lhal correspond lo differenl beakers in fhe model and do nol need lo operate direclly 
on fhe same variable. Analogously, fhe synapfic efficacy does nol need fo be read oul from fhe firsl 
beaker. If could be delermined by anofher beaker or even be some funclion of fhe liquid levels of mulli- 
ple beakers (see also Suppl. Info. S.4i. If fhe inpuf beaker is nol read oul, very recenl memories mighl 
nol be immedialely available for relrieval as fhe liquid has fo propagale lo fhe readoul beakers firsl. 


Anofher nalural consequence of fhe archileclure of Ihese synapfic models is melaplaslicily, fhe depen¬ 
dence of plaslicily on fhe hislory of previous synapfic modifications. Here, melaplaslicily is an obvious 
consequence of fhe existence of hidden variables, represenled by fhe beakers lhal are nol direclly read 
oul lo determine Ihe synaptic efficacy. For example, synapses lhal undergo a long series of polenlialing 
evenls become more resislanl lo depression ( [O’Connor et al.[ 2005 1 |. A long series of LTP induction pro¬ 
tocols can significantly increase the liquid levels in several beakers, making it more difficult to stabilize 
a subsequent synaptic depression. This is illustrated in Pig.[^, in which we plotted the synaptic efficacy 
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as a function of the time elapsed since an LTD induction protocol in two cases: in the first one LTD is 
preceded by a short series of LTP events, and the depression is relatively stable. In the second case LTD 
is preceded by a long series of LTPs and the synapse is only transiently modified even fhough there are 
now two LTD events. The different degrees of plasticity are determined by different initial conditions of 
the hidden variables (despite equal initial efficacies), which were sef by fhe previous hisfory of synapfic 
modificafions. 


In Fig.|^ we show thaf fhe model can also replicafe fhe empirical phenomena known as spacing effecfs. 
The sfabilify of memories fhaf are stored repeafedly is known fo depend on fhe spacing befween fhe limes 


of memorizalion. This phenomenon has been observed in several behavioral sludies (see e.g. Anderson 


19951 and more recenfly in eleclrophysiology experimenfs on synapfic plasticity (see e.g. Carew el al. 


|1972t Zhou ef ahj 20031. In Ihese experimenfs, when fhe inlerval befween repefilions is too shorl or too 
long, fhe memories are less slable lhan in fhe case in which the repetitions are properly spaced. Our 
explanation for these observations is surprisingly simple. Consider the analogy with communicating 
vessels when a synapse is repeatedly potentiated. In the case of long lags, the liquid added during 
potentiation has time to almost settle to equilibrium between repetitions, leading to little accumulation 
of synaptic modifications. In the case of massed repetitions on the other hand, one of the dynamical 
variables may hit its upper bound, which would correspond to liquid spillover in our analogy. The 
overall effect of potentiation would also be reduced by this loss of liquid. 


Figure 4 (following page): A broad class of complex synaptic models with equivalent memory perfor¬ 
mance. A. A generalization of the model shown in Fig. [T] in which each dynamical variable is coupled 
to two or more other variables. For simplicity we consider continuous dynamical variables (i.e. not dis¬ 
cretized). The model is constructed iteratively starting from the linear chain of beakers of Fig. [T] For 
example, the second beaker is now connected to two beakers on the right. Analogous splittings and 
mergings lead to the set of beakers of the figures. When fhe cross-secfions of fhe fubes are properly 
funed, fhe memory performance of the model is the same as for the original linear chain of beakers. B. 
Metaplasticity: the dynamics of the synaptic efficacy depends on the history of synaptic modifications. 
Red: the synapse is depressed at time 0 and the depression is preceded by a short series of 5 LTP events. 
In this case the last LTD event is still effective and long-lasting. The unit of time is one second. Blue: de¬ 
pression at time 0 is preceded by a long series of 50 LTP events and another LTD event. The depression 
of the synapse is only transient revealing that the synapse is more resilient to long-term changes than 
in the case in which depression is preceded by a short series of LTP events. The number of LTP/LTD 
events has been chosen so that the initial efficacy is approximafely fhe same in fhe fwo cases. C. Spacing 
effecf: synapfic efficacy 100 fime sfeps afler fhe end of a sequence of fhree LTP evenfs which have been 
spaced differentiy for differenf poinfs. Massed repefilions of LTP evenfs (shorf inlervals) and dislribufed 
repefilions (long inlervals) are less effective fhan property spaced ones. The opfimal inlerval is around 
40 time steps. The unit of time is again one second, to match the timescales of the experiments described 
ini! 


Zhou et al. (2003 1 . 
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Testable predictions 


One of the testable quantitative predictions of the theory concerns the rate of decay of memory traces. 
A power-law decay of the memory SNR approximating Xjyft is a signature feature of the models that 
we discussed. Although it is known that memory decay can be described by power laws in psychology 
experiments (see e.g. Wixted and Ebbesen 1991[ 19971, the power varies significantly from experiment 
to experiment, and it is difficult to draw conclusions. This variability is probably due to the fact that in 
most of the experiments the memories are not random and uncorrelated, as subjects often experience the 
same or similar episodes multiple times and can even internally rehearse previously stored memories. 
Consequently, the memory decay depends on the specific sfafisfics of fhe memories, fheir relafive impor- 
fance, and fhe rafe af which fhey are rehearsed or re-experienced. Complex sysfem level processes fhaf 
deal wifh such sfrucfured memories were nof incorporated in our model and in any case could be difficull 
fo confrol in experimenfs. 


A feasible experimenf fo fesf our fheory would be fo repeatedly modify a single synapse (or a populafion 
of synapses) and observe how fhe aufocorrelafion of fhe synaptic efficacy decays wifh time. A balanced 
random series of LTP and LTD profocols can induce mulfiple changes in fhe synapfic efficacy. We 
expecf fhaf fhe observed aufocorrelafion would be very broad and ifs decay only logarifhmic on long 
fimescales (shorfer fhan fhe memory lifefime; see Mefhods S.5 and Suppl. Info. S.6 for defails). Such 
a logarifhmic decay is a disfincfive feafure of models wifh a signal fo noise rafio fhaf approximafes 
1 / \/f. As shown in Fig. [^, fhe aufocorrelafion is approximafely a sfraighf line when plotted againsf fhe 
logarifhm of fhe time lag. Models wifh faster memory decay (1/f^/^ and \jt are shown in fhe figure) 
are characterized by aufocorrelafion functions fhaf decay significanfly faster, wifh a prominenf positive 
curvafure. Inferesfingly, fhe slope of fhe line depends on fhe longesf timescale of fhe memory sysfem 
under considerafion. As fhis fimescale increases, fhe slope decreases, and fhe line becomes progressively 
more horizonfal (see Fig.[^). 


While fhere are several fechnical issues complicafing such an experimenf, we believe fhaf none of fhem 
are insurmounfable: Firsf, fhe duration of fhe experimenf should be long enough fo cover af leasf fhree 
orders of magnifude (e.g. wifh 1000 brief inducfion profocols). Second, LTP and LTD should be suifably 
balanced. Since one of fhe fwo profocols mighf be more effecfive fhan fhe ofher, some calibration would 
be required fo avoid imbalance. Unforfunafely, fhe calibrafion procedure may require a lime fhaf is as 
long as fhe duration of fhe experimenf fhroughouf which fhe aufocorrelafion is measured, as balance 
should be achieved on all timescales fhaf are considered. 


Discussion 


We have presenfed a broad class of synapfic models fhaf exhibif a huge memory capacity. These models 
show fhaf complexity, which is widely observed in all fypes of biological synapses, is imporfanf fo 
achieve long memory lifelimes and sfrong inifial memory fraces. Complexify was already shown fo 
be beneficial in previous models, bofh for synaptic (Fusi el al. 2005 1 and for systems level memory 
consolidation ( |Roxin and Fusi[ 20131. In bofh cases fhe memory fraces were initially stored in fasl 
variables and fhen progressively Iransferred fo slower variables. Mulfiple timescales and memory Iransfer 


16 

















Minute Hour 




0.8 


0.6 


0.4 


0.2 


Minute 


Hour 


10 ' 10 ^ 10 ® 
Lag+1 (num. of induction protocols) 


Figure 5: Testing the model in experiments. A. Autocorrelation of synaptic efficacy in a simulated 
experiment in which a synapse undergoes a long random series of 10000 LTP and LTD protocols. Here 
we assumed a rate of 10 protocols per minute. On the lower horizontal axis we represent the lag expressed 
in terms of the number of protocols and on the upper axis we represent the time lag. Notice that in 
both cases the scale is logarithmic. The three curves represent the autocorrelation functions for three 
different models. Our proposed model (light red) has a distinctive decay, which appears almost as a 
straight line on a log-linear plot. Other models, with faster decays of the signal to noise ratio, exhibit 
autocorrelations with a significantly steeper falloff. The shaded areas represent the standard error for 20 
repetitions of the experiment. B. Dependence of the autocorrelation function on the longest timescale 
of the memory system under consideration. The different lines, again plotted on a log-linear scale, 
correspond to progressively increasing longest timescales. In the limit of very large timescales, this line 
would become horizontal. 


were the two key ingredients needed to achieve simultaneously slow decays of memory traces and strong 
initial signals. A 1/f decay, with t the age of the memory, led to initial memory traces and memory 
lifetimes whose magnitudes scale as Vn, where N is the total number of synapses. We showed here 
that it is possible to combine the same key ingredients to drastically extend the memory lifetime without 
sacrificing the initial strength of the memory traces and without substantially increasing the complexity 
of the synapse (e.g. the number of dynamical variables). Indeed, the model presented here exhibits a 
significantly slower decay, approximately 1 / \/t, which permits memory lifetimes that scale almost like 
N instead of ^/N, and initial SNRs that are basically the same as in the old models (see Suppl. Info. S.7 


for a direct comparison between models). When considering large systems like the human brain, this is 
potentially a huge improvement, that has been obtained by introducing bidirectional interactions between 
fast and slow variables. 


Note that in our model the interactions between fast and slow variables are significantly more impor¬ 
tant than in previous models. In Suppl. Info. |S.8| we show that it is possible to build a system with 
non-interacting variables that exhibits a l/Vf decay. However, this requires disproportionately large 
populations of slow variables, which greatly reduce the strength of the initial SNR. In the best case, the 
memory lifetime scales only like \/iV, and the initial SNR like Both quantities are substantially 

worse than in our model with interacting variables. This is not the case for previous models, for which 
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the advantage of interactions was significant, but the scaling properties of both the memory lifetime and 
the initial SNR were approximately the same in the interacting and non-interacting case. 


The proposed synapses are complex, as they require processes that operate on multiple timescales, but 
the number of these processes is relatively small and scales only logarithmically with the memory life¬ 
time. This is achieved by properly spacing the characteristic timescales. As the dynamical variables can 
all be varied independently, the space of all possible states of each synapse can be huge and grows expo¬ 


nentially with the number of variables (see Suppl. Info. S.9l. This is known to allow for slow memory 
decay (Lahiri and Ganguh] 20131. 


The improved performance obtained with slow decays requires some degree of tuning. As we showed, 
the model is robust to certain types of perturbations of the parameters, but significantly less so to others. 
Specifically, when described as a sef of communicafing vessels, fhe model can folerafe surprisingly large 
variations of fhe beaker and lube sizes. However, fhe memory performance decreases drasfically when 
polenlialions are nol balanced wifh depressions (see Suppl. Info. |S.3| ), or when fhe sef of communicafing 
vessels has leaks fhaf are larger fhan fhe one of fhe lasf beaker. The necessary forms of funing increase 
fhe memory performance by several orders of magnifude, and fherefore are probably encoded genefically 
and mainfained acfively by homeoslafic mechanisms. A failure of fhese mechanisms can lead fo dramafic 
consequences. The model sensifivify lo fhe pofenfialion/depression imbalance could be relafed fo fhe 
severe degradation of memory performance observed in fhe early sfages of Alzheimer’s disease, when 


depression becomes significanlly more effeclive fhan pofenfialion (see e.g. Shankar el al. 20081. 


Optimality of the model 


As previously noled, fhe approximate l/\/f decay of fhe memory Irace is fhe slowesl allowed among 
power-law decays. Slower decays lead lo synaplic efficacies fhaf accu mulafe changes loo rapidly and 
grow wifhoul bound. Interestingly, one can prove (see Suppl. Info. S.l I fhaf fhe l/\/f decay maximizes 
fhe area belween fhe log-log plol of fhe signal lo noise ratio and fhe Ihreshold (i.e. fhe area belween 
fhe SNR curve in Fig. and fhe Ihreshold). This slalemenl is Irue nol only when one reslricls fhe 
analysis lo power laws, buf also when all possible decay functions are considered. One mighl wonder 
whal would be fhe rationale behind maximizing fhe area under fhe log-log plol of fhe SNR. The inluifive 
reason can be summarized as follows: while we wanl lo have a sizable SNR lo be able lo relrieve a 


memory from a small cue (see e.g. Kraulh el al. 19881, we do nol wanl lo spend all our resources 


making an already large SNR even larger. Thus we discounl very large values by faking a logarilhm. 
Similarly, while we wanl lo achieve long memory lifetimes, we do nol focus exclusively on Ihis al fhe 
expense of severely diminishing Ihe SNR, and Iherefore we also discounl very long memory lifetimes by 
laking a logarilhm. While pulling less emphasis on exlremely large signal lo noise ratios and exlremely 
long memory lifetimes is very plausible, Ihe use of Ihe logarilhm as a discounting function is of course 
arbilrary. 

Il is interesting lo consider also Ihe case in which Ihe SNR is nol discounted logarilhmically, i.e. when 
one wanls lo maximize Ihe area under Ihe log-linear plol of Ihe SNR. In Ihis silualion, Ihe optimal decay 


is faster, namely 1/1, as in some of Ihe synaptic models previously considered (Roxin and Fusi 2013 
Fusi el aQpOOSI ). 
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Biological interpretations 


To understand how the proposed model can be implemented by biological processes, it is important to 
discuss the possible interpretations of its dynamical variables. One possibility is that the variables Uk 
represent the deviations from equilibrium of chemical concentrations (see Suppl. Info. S.IO for details). 
The timescales on which these variables change would then be determined by the equilibrium rates (and 
concentrations) of reversible chemical reactions. However, for the slowest variables, which vary on 
timescales of the order of years, it is probably necessary to consider biological implementations in which 
each Uk corresponds to multiple interacting processes. For example, we showed that the slowest variable 
can be discretized with only two levels, and hence it could be implemented by a bistable process, which 
would allow for very long timescales (Crick 1984 Miller et Hi] 2005 1 Si et al.[ 20031. These biochemical 
processes could be localized in individual synapses, and recent phenomenological models indicate that 
at least three such variables are needed to describe experimental findings ( Ziegler et al.[ 20151. However, 
these processes could also be distributed across different compartments of a neuron, across different 
neurons in the same local circuit or even across multiple brain areas. If two coupled Uk variables reside 
in different neurons, their interactions must be mediated by processes that likely involve neuronal activity, 
such as the widely observed replay activity, as proposed in Roxin and Fusi ( 2013| ). In the case of different 
brain areas, the synapses containing the fastest variables might be in the medial temporal lobe, e.g. in the 
hippocampus, and the synapses with the slowest variables could reside in the long-range connections in 
the cortex. In all these cases the parameters N and m of the model would have a different interpretation 
that depends on the specific archifecfure of fhe modeled neural circuifs, buf fhe scaling properfies of fhe 
sysfem would be as opfimal as in fhe case fhaf we discussed. 


Memory retrieval in simple neural circuits 


The ideal observer approach allowed us fo analyze fhe scaling properfies of memory sysfems wifh hardly 
any assumptions abouf fhe archifecfure of fhe neural circuif, fhe specific learning rule and fhe neural 
represenfafions. However, if is imporfanf fo fesf whefher fhese scaling properfies are preserved in specific 
simulafed neural circuifs. In Suppl. Info. S.12 we reporf fhe analysis of fwo simple cases of memory 
refrieval, which have been used as memory benchmarks in fhe pasf. The firsl one is a simple feedforward 
percepfron storing random patterns. The second one is a fully connecfed recurrenf neural nefwork similar 
fo fhe one proposed by Hopfield ( Hopfield] 19821, whose memory capacify is estimated bofh in full 
simulafions of fhe dynamical nefwork and fheorefically, as in Amif and Fusi (1994 1 , where fhe signal- 
noise rafio analysis is basically equivalenf fo fhe ideal observer approach used in our arficle. In bofh 
cases, fhe ideal observer approach predicfs fhaf fhe number of storable memories scales linearly wifh fhe 
number of neurons Nn- Note fhaf in fhe recurrenf nefwork fhe fofal number of synapses is of 0{N^). 
However, nof all of fhese synapses receive independenf inpufs, as differenf neurons read ouf presynapfic 
acfivify pafferns fhaf are highly correlated, since fhey differ only by one neuron. As fhe ideal observer 
approach assumes fhaf fhe synapfic modifications are independenf, one should consider only fhe iV„ 
independenf synapses. 


The scaling is verified in simulafions in Suppl. Fig. in fhe case of fhe percepfron. Interestingly, in 
fhese neural circuifs we can also sfudy fhe abilify of fhe readouf neuron fo generalize. In fhe case of fhe 
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perceptron we trained the readout neuron to classify random input patterns and then retrieved memories 
by imposing on the input neurons degraded versions of the stored patterns. The ability to generalize can 
be expressed in terms of the minimum overlap between the input and the memory to be retrieved that can 
be tolerated (i.e. that produces the same response as the stored memory). This overlap is directly related 
to the SNR, as previously known (see e.g. Krauth and Mezard ( |1987 1), and in Suppl. Info. S.12 we show 
analytically and in simulations that in our case it scales like 1/SNR. This demonstrates the importance 
of large SNRs that are significantly above the minimum value which is required to retrieve memories 
when the cues are undegraded. Large SNRs allow for a significantly larger ability to generalize. 


Sparseness and synaptic complexity 


Sparseness is known to be important for increasing the memory capacity of neural circuits such as re¬ 


current neural networks, both for synapses that can vary in a unlimited range (Tsodyks and FeigeTman 


19881 and for bounded, bistable synapses ( |Amit and Fusi 19941. In both cases the number of memories 
that can be stored scales almost quadratically with the number Nn of neurons when the representations 
are extremely sparse (i.e. when /, the average fraction of active neurons, scales approximately as 1/Nn)- 
This is a significant improvement over the linear scaling obtained for dense representations. However, 
this capacity increase entails a reduction in the amount of information that is stored per memory. 


The synaptic model we propose can also benefit from sparsification, as discussed in Suppl. Info. S.13 


where we show that the SNR increases by a factor of up to 0(1/ ^/J) when a network similar to the one 
discussed in (Amit and Fusi| 19941 is considered. The beneficial effects of sparseness that led to this 


improvement in memory performance in (Amit and Fusi 19941 were at least threefold: the first one was 
a reduction in the noise, which occurs under the assumption that during retrieval the pattern of activity 
imposed on the network reads out only the / synapses (selected by the / Nn active neurons) that 
were potentially modified during the storage of the memory to be retrieved. The second one was the 
sparsification of the synaptic modifications, as for some learning rules it is possible to greatly reduce the 
number of synapses that are modified by the storage of each memory (the average fraction of modified 
synapses could be as low as /^). This sparsification was almost equivalent to changing the learning rate, 
or to rescaling forgetting times by a factor of 1//^. The third one was a reduction in the correlations 
between different synapses, which is also discussed in the next subsection. 


Our model can also benefit from these effects and from others that are discussed in Suppl. Info. S.13 
To quantify the memory improvement we need to specify how the synapses are modified and then read 
out. In Suppl. Info. S.13 we present the analysis of two different learning rules, and show that when 
/ ~ 1/Nn, the memory capacity scales approximately quadratically with Nn, as in (Amit and Fusi 


19941, but with an initial SNR that is 0{Nn) times larger for our proposed model. 


It is important to note, however, that the sparseness / has to scale with the number of neurons of the 
circuit in order to achieve a superlinear scaling of the capacity. While / ~ 1/Nn may be a reasonable 
assumption which is compatible with electrophysiological data when Nn is the number of neurons of the 
local circuits, this is no longer true when we consider neural circuits of a significantly larger size Nn- 
Moreover, sparseness can also be beneficial in terms of generalization, but only if / is not too small (see 


e.g. Barak et ^ 20131. For these reasons, sparse representations are unlikely to be the sole solution 
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to the memory problem. Nevertheless, plausible levels of sparsity can certainly increase the number of 
memories that can be stored, and this advantage can be combined with those of synaptic complexity. 


Limitations of our theoretical framework 


Our estimates of memory capacity are based on the ideal observer approach, i.e. the point of view of an 
observer who can measure the efficacies of all synapses to retrieve memories. This is clearly unrealistic, 
as an individual neuron can read out directly only the synapses on its dendritic tree. In the brain the read¬ 
out is probably implemented by complex neural circuitry, and estimates of the strength of the memory 
trace based on the ideal observer approach provide us with an upper bound on the memory signal. We 
validated our results in two realistic local circuits, but it remains unclear how to perform the validation in 
large neural systems respecting the observed sparse connectivity and modular organization of the brain. 


The scalability of such large systems has been studied only in very specific cases (see e.g. O’kane and 
Treves[ |1992t Roudi and Lafham 20071 and is an imporfanf fufure direcfion for our work. 


A second limifafion, relafed fo fhe firsf, arises from fhe assumpfion of random and uncorrelafed synapfic 
modificafions. Alfhough if is reasonable fo assume fhaf fhe brain processes informafion fo be stored such 
as fo memorize only whaf is nof already in memory, if is known fhaf synapfic modificafions are correlafed. 


even when memories are nof (see e.g. Amif and Fusi 1994 Savin el al. 20141. For example, synapses 
on the same dendritic tree share a postsynaptic neuron, and for this reason their efficacies are correlafed 
by many learning rules. Forfunafely, fhe disruptive effecls of fhese correlafions seem fo disappear when 
neural represenfafions are sparse (|Amif and Fusi[ 19941, as we have seen for a specific neural circuif in 


Suppl. Info. S. 13 Sparsificafion of fhe neural represenfafions is nof fhe only way to decorrelafe synapfic 
modificafions. The iniliafion of long ferm synapfic modificafions lypically requires the coincidence of 
relatively rare events. It is not unreasonable to think that these mechanisms can also greatly contribute 
to the decorrelation of synaptic modifications. If this is the case, the theoretical framework that we 
developed will be applicable to a large number of memory systems. 
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Experimental Procedures and Methods 


M.l Formal definition of memory signal and noise 


We assume that memories are stored through synaptic modification, with each new memory being en¬ 
coded in a change in the efficacies of (a subsef of) fhe synapses of a neural nefwork. To formalize fhis 
problem we will represenf each memory as a random binary paflem Awij{t) = ±1 of desired modifi- 
cafions (wifh -i-l representing pofenfiafion and -1 depression) of fhe synapfic weighfs befween neurons 
labeled j and i. We will consider fhe componenfs of Awij{t) fo be uncorrelafed (bofh across differenl 
memories and differenl synapses in a cerfain sel), as would be fhe case if a suifable preprocessing sfep 
had decorrelafed a sfream of incoming paflems for optimal compression. 

Nofe fhal we are nol considering any particular nefwork archifeclure and learning rule, buf insfead we are 
working wifh synapfic modificalions direclly, Ihus sidesfepping fhe learning rule fhal would defermine 
fhem from fhe acfivifies of pre- and posfsynapfic neurons. This makes sense in fhe confexf of fhe ideal 
observer approach, where fhe underlying assumpfion is fhal all fhe information stored in fhe synapfic 
weighfs can be recovered, buf of course if musl be slressed fhal if is nof obvious a priori whelher Ihere 
exisls a nefwork archilecfure fhal can in facl read ouf fhis informalion (see also fhe Discussion secfion). 


Neverlheless, classical memory models support fhe nolion fhal fhe ideal observer approach correclly cap- 
lures fhe scaling behavior of fhe achievable memory performance. For example, in fhe slandard Hopfield 
model ( Hopfield] 1 1982 1 fhe desirable modificalions for a sel of synapses fhal share a posfsynapfic neuron 
would be uncorrelafed (as assumed above) and a simple signal lo noise analysis using fhe ideal observer 
approach correclly predicls a memory lifelime fhal scales linearly wifh fhe number of neurons. 


If we index fhe sel of N synapses under consideration by a (insfead of i and j), fhe signal relevanl for fhe 
relrieval of a particular memory fhal was stored al lime t' is given by fhe overlap befween fhe pattern of 
fhe associaled (desirable) synapfic modificalions Awa and fhe currenl slate of fhe synapfic weighfs Wa af 
lime t\ 

1 ^ 

St'{t) = —(^'^Wa{t) AWa{t')) . (2) 

a=l 


Here angle brackels indicate an average over fhe ensemble of random uncorrelafed patterns fhal form fhe 
sequence of memories impinging on fhis sel of synapses, and we have assumed for simplicily fhal fhe 
expeclafion value of Awa vanishes (i.e. fhe inpuls are balanced), ofherwise a ferm proporlional to fhis 
expeclafion value would have lo be sublracled from fhe above. 

Similarly, fhe corresponding (squared) noise term, again for fhe pattern sfored al lime t' , is given by fhe 
variance of fhis overlap 



The quofienl of fhe signal and ils slandard deviafion, fhe signal to noise ratio, is fhe key quanlily to 
consider when assessing fhe possibility and fidelity of recall of a previously sfored memory. While we 
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have considered a particular pattern stored at time t', we will assume that all memories are initially 
encoded with the same strength (though it is easy to generalize to a distribution of initial strengths), so 
that there is nothing special about any one memory. In this context, if the distribution of the synaptic 
weights reaches a steady state (as it does in the cases we are interested in), the signal to noise ratio 
really only depends on the time t — t' elapsed since storing the memory in question (i.e. the age of the 
memory). Accordingly, we will write it simply as a function of this time difference, which for a wide 
range of models will be monotonically decreasing. 

A good memory system is one that has a large initial signal to noise ratio, such that recent memories 
can easily be retrieved (using only a small, i.e. potentially highly corrupted, cue), and a long memory 
lifetime. The latter is defined as the time elapsed until the signal to noise ratio drops below a certain 
retrieval threshold, the minimum value of S/M at which recall is still possible. The precise value of this 
threshold will depend on the details of the network architecture and the retrieval dynamics, but as long as 
it is of order unity this will not affect the scaling results derived below, and thus in what follows we will 
simply set it to one unless otherwise noted. If the rate of memory storage is constant, the memory lifetime 
is proportional to the capacity of the system, i.e. the total number of memories that can be recalled at a 
given time. The tradeoff between the two goals of large initial signal and long memory lifetime will be 
discussed in detail below, and will eventually lead us to optimizing an appropriately defined area under 
the signal to noise curve that captures the joint target of having as large a signal to noise ratio as possible 
for as long as possible. 


Desiderata for a useful synaptic memory model Our aim here is to build a model of long-term 
memory that exhibits a number of properties we consider essential. We would like our model synapse 
to be able to learn online (one pattern at a time), and forget gradually and smoothly (without phase 
transition such as the catastrophic forgetting in standard Hopfield-type models, see e.g. |Amitl[T98^ . In 
addition to exhibiting a large initial signal to noise ratio and long memory lifetime, the synaptic weights 
should reach a steady state distribution (given constant input statistics) that has support in only a small 
range of values (i.e. that does not allow for arbitrarily large weights, or equivalently, weights in a finite 
range that must be read out with arbitrarily high precision). Note that one can easily obtain a model with 
bounded synaptic weights by restricting (hard-limiting) the range of a standard unbounded synap se (with 


plasticity events of unit magnitude) to values of order \/N (Paris! 


1986 


Fusi and Abbott 


20071, which 


is still an unrealistically large number. We will consider much more tightly bounded synaptic weights. 
Finally, all this has to be achieved while keeping the complexity of the model rather small, i.e. avoiding 
overly baroque internal mechanisms involving too many variables. 


M.2 Abstract models with linear superpositions of memories 

Basic assumptions In order to build an efficient synapse with bounded weights we will start by consid¬ 
ering a continuous synaptic variable with an additive plasticity rule and a time-dependent kernel r{t — t'), 
which we take to be the same for all synapses and plasticity events (i.e. across all stored patterns): 

(*) = X] (^0 - ^0 • (3) 

t'<t 
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By additive plasticity rule we simply mean that the efficacy w\] is a weighted sum over past plasticity 
events, which we take to be of fixed magnitude Awij{t) = ±1 (with a plus sign for potentiation and 
minus for depression). The Awij{t) may be computed from the neural activations corresponding to 
the patterns we want to store. For example, they could be determined according to a covariance rule 
Awij{t) oc (.^i — — (Cj)), where the = ±1 are binary with equal probability for both values 

(such that (^i) = 0). Recall could be achieved by the network dynamics (of an auto-associative, Hopfield- 
type network, see Hopfield 19821 that completes the stored pattern of neural activations from a partial 
(or corrupted) cue 


However, we deliberately divorce our analysis from the choice of learning rule and the network dynamics, 
by focusing on a subset of synapses that receive statistically independent inputs and taking an ideal 
observer approach. Successful retrieval of a previously stored memory then requires the signal to noise 
ratio of this set of synapses to be larger than a certain threshold (which we will set to one). 

We are assuming that potentiation and depression events are equally likeljQ and uncorrelated between 
different synapses and memories. In other words, we consider storing random patterns of synaptic mod¬ 
ifications in which each bit of each memory can be thought of as determined independently by the flip of 
an unbiased coin. 


Signal to noise ratio We have introduced a time-dependent kernel r{t — t') above since otherwise 
the synaptic weight would grow without bound as more and more patterns are stored. This can avoided, 
however, if r{t — t') decays sufficiently fast as a function of the age of the corresponding memory (i.e. the 
time elapsed since storage). 

Following the definition of eqn. <0), the signal (at time t) associated with a particular memory is given 
by the overlap of the corresponding pattern of synaptic modifications (stored at time F) with the current 
synaptic weights, which using the ansatz Q leads to 

Await')'^ =r{t-t') , 

a=l 

where the neuronal indices i and j have now been replaced by a single synaptic index a, ranging over the 
set of synapses under consideration. Combining this with the corresponding noise term, we obtain the 
signal to noise ratio 


S/N{t -1') 


Nr‘^{t — t') 




(4) 


It will be convenient in what follows to approximate the sum in the denominator by an integral over the 
full range of past t" values (see also Section S.5 for details), neglecting the small correction that arises 


'if this was not the case a homeostatic mechanism would be needed to adjust the relative magnitude of these types of 
plasticity events in order to achieve a steady state without introducing any explicit bounds on the synaptic variables wtj. More 
generally one could imagine a distribution over magnitudes of plasticity events, and again the existence of an equilibrium 
without explicit bounds on the weights would require a balance condition, namely that the expectation value of the initial size 
of plasticity events vanishes. Another conceivable generalization would be to introduce different kernels for potentiation and 
depression events. 
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from the fact that there is a term corresponding to t” = t' missing in the sum (since this term is the signal, 
rather than part of the noise). The noise will then be represented by an integral of the form r^(t) dt, 
and thus if the decay kernel is a power law r(t) = is it clear that we must have 7 > 1/2 or else this 
integral will not converge. Crucially, the divergence of this noise integral also implies that the variance of 
the synaptic weight would blow up, so that even if we regularized the integral appropriately for 7 < 1 / 2 , 
the resulting range of synaptic efficacies would be large. Therefore, the slowest power-law decay we can 
afford is r{t) = which is the critical case in which the synaptic variance just starts to diverge (see 

also Suppl. Info.|S.l|). 


M.3 Constructing models by coarse-graining random walks 


Here we describe the procedure for building a model of a complex synapse that implements the required 
forgetting curve (1 / \/t) in a natural and parsimonious fashion. We will begin with general considerations 
of random walks and diffusion processes, and then refine as well as generalize the model step by step, 
throughout Sections M.5|and M .6 


The present section serves primarily to provide a more systematic background for the model construction 
steps leading from Fig. to Fig. [^, and furnish some mathematical details. Reducing the analogy of 
fluid flowing through a system of communication vessel to its most basic ingredients, we will consider a 
random walk of particles (which can be thought of as the molecules in the liquid) along a chain of discrete 
sites (which correspond to the beakers). Even though more abstract and general, this construction is 
equivalent to that of the main text in the particular case discussed there. 


See also Section M.4 for an alternative point of view using the (approximately equivalent) language of 
diffusion processes, which leads to a particularly simple description of the proposed synaptic dynamics 
that allows for analytical derivations of a number of important properties of the model. 


Linear chain models Consider a random walk of particles on a semi-infinite chain in discrete time 
steps. We denote the number of particles at location j at time t by Vj{t) for j = 1... 00 . (Note that this 
number can be negative; we can think of the particle number as being measured relative to a constant 
background density.) At every time step each particle has a finite probability of moving one step to the 
left or to the right. This probability is the same for both directions and for all locations except j = 1, 
which has no left neighbor. It is easy to see that for such a stochastic process the time derivative of the 
particle numbers is equal to a discrete Laplacian: dvj/dt oc vj-i — 2vj + Vj+i for j > 1. In other words, 
we have a spatially discretized diffusion process with constant diffusivity (see top panel of Fig.|Ml|). 


In order to make contact with systems of exponentially varying diffusivities that we are interested in, we 
will now consider discretizing the above random walk even further, on a coarser scale. We introduce a 
new set of coarse-grained variables Ui which are located at positions j = 2 *“^ on top of ^ 2 ^- 1 , i-C. they 
are exponentially spaced. Our goal is to find an effective, approximate description of the system in terms 
of the u variables alone, where we think of each Ui as reflecting the average behavior of the system in 
the interval between its own location and that of its right neighbor Uj+i. 


We can achieve this by assuming that the particle density profile is piecewise linear, with kinks located 
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only at positions j = 2*“^, such that all the curvature (which drives diffusion) is concentrated there. We 
can then use simple linear interpolation to eliminate all the Vj from the equations of motion except those 
that coincide with the Ui. This would lead to the following expressions: dv 2 i-i/dt oc {v 2 i -2 — 
V 2 i-i) — — V 2 i) for i = 2, 3, 4 ..while the time derivatives of the other vj (for which j is 

not a power of two) would vanish, since they are situated in regions of linear particle density. 

However, for the piecewise linear approximation to be self-consistent (i.e. still applicable at the next time 
step) changing the particle number at the end of a line segment must be accompanied by an appropriate 
change everywhere along the segment to maintain linearity. In other words, the time derivative of the 
endpoint V 2 i-i must be distributed among all variables along the line segment. Thus if our effective 
variable Ui is proportional to V 2 i-i, its time derivative has to be proportional to the average derivative 
along the line segment to its righj^ 

There are 2*“^ variables on this line segment and denoting the constant of proportionality by q;/ 2 this 
leads to dui/dt = {ui-i — Ui) — {ui — Ui+i), which describes a discretized diffusion 

process on a logarithmic scale (i.e. as if viewed on a plot in which the spatial axis is logarithmic). 

In such a random walk model a plasticity event would correspond to adding or removing a particle from 
the leftmost location, which modifies the equation for z = 1. If we denote this time-dependent input of 
unit magnitude (and sign discriminating potentiation/depression) by X we find dui /dt = X—2~^a {ui — 
U 2 )- Similarly, if fhe chain is nol semi-infinile fhe equafion for fhe lasf (mfh) variable will only confain 
a coupling fo ifs (sole) lefl-hand neighbor, buf we can add a leak (exponenfial decay) ferm fo if fo render 
fhe variances of all parficle numbers finife. This is easily achieved by simply selling fhe value of fhe 
(non-exislenl) righf hand neighbor fo zero, such lhal dum/dt = {um-i — Um) — Um- 


Model with different ratios of timescales While above and in the main text we have chosen pa¬ 
rameters that vary as powers of two for ease of illustration, this can easily be generalized to arbitrary 
exponents 

^ (ui-i - Ui) - (u* - Uj+i) , (5) 

which still approximates the desired behavior of the Green’s function for arbitrary real-valued 
n > 1, with ratios of successive timescales of 0{v?). The tradeoff in the choice of n is that for large 
n this approximation is not very good (since a superposition of a small number of exponentials leads to 
a rather bumpy surrogate for a power law), while for n only slightly bigger than unity a large number 
m of variables are needed to cover a given range of timescales, say between one and T, namely m ~ 
log r/(21ogn). 

Note that even within the space of linear (and first order in time) equations with nearest neighbor inter¬ 
actions on a chain we could generalize eqn. Q even further by introducing different ratios of succes¬ 
sive timescales instead of just one global parameter re, while still approximating the inverse square root 
Green’s function. 

^The details of this coarse-graining procedure are a matter choice, of course. The mathematically inclined reader will notice 
that it would be more appealing to have a symmetric prescription in which we average over the left and right line segments, 
or alternatively to think of the Ui variables as living in the middle of a line segment. This would merely change the overall 
timescale of variation that is unimportant here, so for ease of illustration we stick with a one-sided prescription. 
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M.4 Continuum space limit and diffusion equation 


In the preceding section we discussed a set of first order differential equations describing a random walk 
of a large number of particles (or equivalently the flow of water between connected beakers). In this 
construction space was discrete from the beginning (represented by a number of sites or beakers), but we 
could have chosen instead to step back even further and start from a model in which space is continuous. 
This even simpler model, which is highly instructive and allows for an intuitive explanation of important 
properties such as the l/\/f decay, connects the proposed synaptic dynamics to heat diffusion on a line 
(e.g. along a thermally insulated wire). 

Consider the one-dimensional diffusion equation (with u{x', t) interpreted as the temperature profile 
along a homogeneous rod) 


du d^u 

dt dx''^ 

Its Green’s function for a (5-function input (of one unit of heat energy) at time t = 0 


( 6 ) 


Gu{x\t) 


1 

\/ AnDt 


e iDt 


(V) 


decays as l/\/f at the origin (i.e. at x' = 0, where the (5-function is localized). Thus if we represent 
the input to the system by such an instantaneous pulse, the correct decay of the signal is already built 
in, as long as we read out the synaptic weight at x' = 0. Since the equation is linear, we can simply 
superimpose Green’s functions for a sequence of such inputs (positive for potentiation and negative for 
depression), and they will behave as required by eqn. Q. 

Even though the Green’s function we wrote here is for an infinite line, it is symmetric around the origin, 
and thus we can simply fold the system in half (leading to a Neumann boundary condition) and use the 
same Green’s function (up to a factor of two) on the semi-infinite line. A (5-function input at the origin 
will then evolve into a half Gaussian bump that will gradually spread out (the peak remaining at the 
origin) with a standard deviation that grows in proportion to y/t. 

To revert back to the system of communicating beakers described above, we simply have to spatially 
discretize this diffusion process by chopping up the rod into finite chunks and considering the resulting 
interactions of the average temperatures of those chunks. The piece closest to the origin corresponds 
to the synaptic weight, while the other ones give rise to the hidden variables. If all those chunks have 
the same (say unit) size, this will lead to the system shown in Fig. 1C. While it has the correct decay 
behavior, the system cannot be of infinite extent. There will be some finite number m of separate chunks, 
and when the width of the Gaussian bump becomes comparable to the total size of the system, the l/\/i 
decay of the Green’s function Q, which assumes an infinite system, will break down. In other words, if 
there is a second boundary, we have to choose a boundary condition there, which will modify the power 
law decay on a timescale T ~ m^. Thus if want to achieve an extensive memory lifetime T ~ N, the 
number of variables that would be required is m ~ \/iV, which is unrealistically large. 

Note that we have assumed that the system is purely diffusive and free of any drift term. If that was 
not the case, the situation would be even worse, since the peak of the Green’s function would move at a 
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finite velocity and hit the second boundary at a time T ~ m, so that we would need even more variables 
(m ~ N) to obtain an extensive memory lifetime. 

Fortunately, drastically reducing the number of required variables while maintaining a close approxima¬ 
tion to power-law decay is not difficult. Recall that the (thermal) diffusivity D in eqn. Q in general is 
a ratio of a thermal conductivity g{x) and a heat capacity C{x), and that those can be spatially varying, 
which leads to the more general diffusion equation 

du 1 d f du\ 

1h ^ c{^ ^ ^ ) ■ 

If we break the homogeneity of the system by introducing exponentially varying parameters C{x) ~ 
and g{x) ~ we obtain the differential equation 


du D d f du'\ 

dt (5‘^eP^ dx \ y ’ 

parameterized by positive constants D and j3, which has a Green’s function given by 


Gu{x,t) 


1 (e/3^-l)2 

, e iDt 

\J 4:7TDt 


( 8 ) 


It describes a signal (in the form of a temperature difference) that propagates only very slowly towards 
larger x. This is because the thermal conductivity decreases exponentially while the heat capacity in¬ 
creases with X, and thus an input given by a certain amount of heat energy at x = 0 will lead to a no¬ 
ticeable temperature difference at finite x only at exponentially large times, when t rsj (e/3^ - if/{AD). 
Therefore, to reach an extensive memory lifetime the largest value of x we need to consider, which is 
proportional to m, will now only scale as log N. 

Throughout this diffusion process, the heat energy Q = f dx C{x) u{x, t) is a conserved quantity, mod¬ 
ulo a leakage term potentially introduced by the second boundary condition at x ~ m. Spatially dis¬ 
cretizing this system as above leads to the model of communicating vessels shown in Fig. IE, which 
achieves the correct power-law decay and extensive memory lifetime with only a logarithmic number of 
variables. 


Note that the two continuum models Q and ([^ we have discussed in this section are in fact equivalent 
under the change of variables x' = — 1. This implies that there is another way of arriving at 

the simple linear chain model we want. We can start from a homogeneous diffusion process (constant 
diffusivity), but instead of discretizing space on a linear scale (into chunks of equal length) we can 
discretize on a logarithmic scale (i.e. divide the system into chunks of exponentially increasing size). 
This is precisely in the spirit in which we have described the transition from the homogeneous random 
walk (communicating vessels of constant size) to the desired linear chain model in Fig. 1 and Section 


M.3 Both are approximations to a simple one-dimensional diffusion process, but spatially discretized in 


different ways, with the latter being much more efficient in terms of the number of variables needed. 
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M.5 Detailed description of the linear chain model in discrete time and with quantized 
variahles 


While above we have written equations for a continuous time system, it is a simple matter to discretize 
time, as is appropriate for a incoming stream of temporally discrete patterns representing different expe¬ 
riences to be stored. We will choose one time step to correspond to one such memory and write 

Ui{t + l) = 'Uj(f)-h - rtj(f)) - (ui(f) - Mi+i(f)) . (9) 

Again, the last equation (for z = m) is obtained from this by simply setting Um+i = 0 for all times (thus 
introducing an exponential decay term on very long timescales), while the first equation (for z = 1) is 
modified by introducing the binary input of unit magnitude T{t)\ 

ui{t + 1) = ui{t) + I{t) - n~^a (zzi(f) - ■ (10) 

The value of a is a free parameter in these equations that determines the overall timescale of the dynamics 
(but should be chosen small enough such that the transition matrix on the left hand side of these equations 
has no negative eigenvalues, which could lead to oscillations). We will take a = 1/4 below and in all 
numerical experiments. 

Having discretized time we are now left with a model of a complex synapse consisting of a small number 
m of coupled variables operating in discontinuous time steps (one step per incoming memory) according 
to the deterministic (given T(f)) dynamical eqns. (j^ and ( [l^ . However, the values of these variables 
are still continuous, and in the next step we will discretize those as well, thus turning the model into a 
Markov chain (with inputs given by the random patterns to be stored and stochastic transition dynamics 
for the Ui). 


In order to achieve this discretization, we will simply declare that every variable can only take one of a 
finite number of values (which we will refer to as levels) at every time step. We assume that these levels 
are integer-spaced and distributed symmetrically around zero (such that for an odd number of levels the 
allowed values are integers, while for an even number they are odd multiples of one half), though the 
algorithm described below can easily be generalized to arbitrary choices of discrete levels. 


For every time step, we first compute the right hand sides of eqns. Q and (101, with the Ui{t) from 
the previous time step entering as (half) integers (and similarly the input X{t) as ±1). If the resulting 
Ui{t + 1) happens to coincide with one of the quantization levels there is nothing further to be done, but 
in general the result will fall in between two levels, and in that case we have to decide which of the two 
neighboring levels will be the new state of that variable. This can be done by independently flipping a 
biased coin for each such decision, wifh fhe odds rafio of fhe coin (corresponding fo one or fhe ofher 
level being chosen) equal fo fhe inverse rafio of fhe disfances from fhe desired (non-quanfized) value fo 
fhe respecfive levels, such fhaf fhe closer one of fhe neighboring levels will be more likely. 


The number of levels for each variable is finife, and if fhe righf hand sides of eqns. ([^ and ( [T0| lead fo a 
desired updafe for any variable Ui fhaf would cause if fo become larger fhan fhe value of ifs highesf level 
we sef if fo fhis level wifh probabilify one (and similarly for fhe lower end of ifs dynamical range). 


This is fhe fully discrefized, sfochasfic model fhaf we use for simulafions, in particular fhose shown in 
Figs. m and If should be sfressed, however, fhaf fhe quanfizafion of fhe variables is neifher 
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necessary for the model to work, nor required for a plausible biophysical implementation. In fact the 
signal to noise ratio will be somewhat higher without the additional noise introduced by the stochasticity 
of the random choices between nearby levels (though the scaling behavior appears to be the same). 
However, even though we don’t need stable, discrete levels, we do perform this quantization in order to 
emphasize the fact that the variables never need to be kept track of with high precision (as long as there 
is no systematic drift) and that there is no biologically implausible information hidden in exactly read 
out continuous variables. 


M.6 Models on more general graphs 


Using the procedure of coarse-graining a random walk with exponentially spaced variables described in 
Section M.3[ we can construct a broad family of models that are equivalent to the above nearest neighbor 
chain models, but whose topology is that of an arbitrarily complex graph instead of a one-dimensional 
chain. 


Let us illustrate this with a simple example, a graph containing a single loop (see Fig. Ml i. We start 
again from a homogeneous chain of Vj variables, which will be coarse-grained over intervals of expo¬ 
nentially increasing length £ (growing as powers of two) to obtain the effective Ui variables that appear 
in our model. However, before coarse-graining, we split the chain in two and assign arbitrary impor¬ 
tance weights (jj and I — p, with 0 < p < 1) to the two branches. Since the dynamical equations are 
linear, this can be done in such a way that in their influence on the input/output variable ui (whose be¬ 
havior directly determines the synaptic efficacy) the combined dynamics of these two branches remains 
indistinguishable from that of a single branch of unit weight, regardless of the value of p. 


In the random walk picture of the resulting model, when a particle takes a step to the right from U 2 , 
it chooses the lower branch with probability p and the upper one with probability 1 — p. This means 
that compared to the original single chain the couplings of U 2 to its right-hand neighbors tt 3 and U 4 
are effectively multiplied by these probability factors (the connecting tubes are narrower, but their sizes 
add up to that of the original tube connecting U 2 and U 3 in the single chain). On the other hand, the 
capacities for and U 4 to absorb particles are also multiplied by these weight factors (the sizes of the 
corresponding beakers are reduced to maintain the same overall capacity), such that all factors of p or 
I — p cancel in their equations of motion. When the two branches recombine at u^, their importance 
weights add to unity, and while the couplings to the left still carry the branching factors, the capacity of 
Ms does not (since this beaker is the same is in the original chain). This leads to the set of equations 


dM2 

dt 

dU3 

dt 

dM 4 

dt 

dU5 

dt 


a , , ap. . a{l — p), , 

-(mi - M2) -h —(M3 - M2) d-^-(M4 - M2) , 

^(m 2 - M3) -h ^(m5 - M3) , 

^(n2-M4) + ^(M5-M4), 

ap, , a (1 —p) , . a , . 

- ''^ 5 ) +- ^ - (^^4 - U 5 ) + - U 5 ) . 


while the equation for mi will be the same as in eqn. ( [T 0 | ) for n = 2 . Note that e.g. the equation for M 3 
is the same as for the single chain according to eqn. Q, and if the two chains didn’t merge again at M 5 
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Figure Ml: Constructing a model with the topology of a simple graph with a single loop. From top 
to bottom: A simple, unbiased random walk on a homogeneous, semi-infinite chain, where each circle 
denotes a vj variable representing the local particle density. A part of the chain is divided into two parallel 
branches, with assigned importance weights that add to one (determined by the branching probability 
p). Coarse-graining using intervals of exponentially increasing length as in Fig. [T] and Section M.3| 
(i.e. merging variables within regions outlined in red and rescaling coupling constants between them) 
leads a simple model in terms of the Ui variables, in which ui has the same dynamics as in the single 
chain model. 


similarly all variables to the right of U 2 would obey the same equations as the corresponding ones in the 
original model. It is easy to see that pu^ + {1 — p) is the only combination of the branched variables 
that affects their neighbors, and it is this combination that also appears in the conserved quantity of the 
system. 

We can easily generalize this procedure to construct more complex graphs with multiple (perhaps par¬ 
tially or fully recombining) branches that are still completely equivalent to a linear chain for any choice 
of branching probabilities as long as corresponding variables on parallel branches (vertically aligned in 
the diagrams) are coarse-grained using intervals of the same length. 

Even more generally, we can construct a large class of models that are only approximately equivalent to 
a linear chain, by choosing different discretization intervals on parallel branches, and choosing different 
patterns of separating and recombining branches that are not commensurate with exponentially increas¬ 
ing spacing. This will still give an efficient approximation to the desired behavior for ui as long as the 
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mean interval size (e.g. taking a weighted average over parallel branches) increases roughly exponentially 
going from left to right. 


Let us now describe an algorithm for choosing appropriate parameters for such a model (which subsumes 
the simple example discussed above). We are given an undirected graph with junctions at the locations of 
(some of) the Ui and a set of branching probabilities pi for each dividing junction (in general there may 
be more than two branches emerging at one junction, in which case one could use probabilities pik for 
the branch connecting Ui to Uk)- As above, we can formalize the problem in terms of coupling constants 
Pik between variables and their neighbors, and capacities Ci, which need to be chosen s uch as to endow 
the system with the correct (~ 1/Vi) Green’s function for the first variable (see Fig. M2i. The right 
hand side of the equations of motion will take the form of a weighted Laplacian on the graph: 


duj 

dt 


1 

a 


Ai adjacent to 2 


( 11 ) 


where pik = Pki = aiOik/iVliV) and Ci = Erightneighbors fc(summing over edges to adjacent 
variables Uk to the right of Ui, unlike in eqn. ( [TT] ) where the sum is over all adjacent variables). Here Ck 
is the length of the interval (the number of steps in the random walk construction of Fig. Ml| ) between 
neighboring variables Ui and Uk, while ojik is the importance weight of the branch connecting them. 
The branch weights ujik are computed by traversing the graph from the input to the right, multiplying 
by the appropriate branching probability p whenever a dividing junction is encountered, and adding 
the corresponding weights when two branches recombine (such that the ujik always add to one along any 
vertical cut through the graph). To illustrate this simple prescription, we have spelled out in the schematic 
of Fig. [M^ the capacities and coupling constants for the case of the complex model of Fig. [^. 

As above, the equation for the first variable ui will be modified by the addition of the input term, and 
the equations for the rightmost variables by the addition of exponential decay terms (couplings to the 
reservoir). Note that if there are multiple branches extending all the way to the right, appropriate leakage 
terms should be added to all of them. 

One should bear in mind that we have set n = 2 in this section only for illustrative purposes, to provide 
an intuitive connection to integer-spaced random walks, but the construction of this family of models on 
graphs can be extended to arbitrary real-valued n without difficulty. The distances iik that determine the 
parameters Ci and pik don’t have to be integers. They would be proportional to n* in the linear chain 
model (for which the right neighbors always have k = i + 1 and the importance weights uJik along the 
sole branch are all equal to one), such that eqn. ([TT) agrees with eqn. d5)up to a trivial rescaling of 
the overall timescale l/a. For a more general graph laid out as in Fig. Ml we can take £ik to be the 
horizontal distance between adjacent variables, assuring that the total distance between two given nodes 
is the same along any branch connecting them. 
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Figure M2: Diagram of a more complicated graph model with multiple partially recombining branches, 
corresponding to the network of beakers shown in Fig. |^. We label the Ui variables from left to right 
and bottom to top, with branching probabilities pi carrying the index of the associated Ui, and qi defined 
as 1—pi. The capacities Ci are shown above the corresponding variables, and the coupling constants gik 
are given in red next to the arrows connecting pairs of interacting variables (or variables to the reservoir). 
Using these parameters in eqn. the dynamics of ui is again equivalent to that in a single chain 
model. 
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Supplemental Information 


S.l Linear superpositions of memories: optimal decays 


Using the abstract model of Section M.2 we can consider optimizing (certain objective functionals de¬ 
pending on) the signal to noise ratio of eqn. (Q over all possible decay functions r{t). This can be 
achieved using an integral approximation and variational arguments as follows. 


Maximizing the signal to noise ratio We want to maximize the doubly logarithmic area A{T) under 
the signal to noise curve (normalized here for a single synapse) 


AogT 

A{T) = / d(logt) log 
do 


r{t) 


y/fi dt'r(t') 2 _ 


f^dt logT 

ylogr(f)- 




on a fixed time interval 1 < f < T, under the assumption that at time T the signal to noise ratio will 
still be above (or at) the retrieval threshold (since we really only care about the area above the threshold, 
and without this assumption we would have to take the positive part of the logarithm of S/M over the 
threshold before integrating). A simple variational argument proceeds by perturbing r{t) by an arbitrary 
function of bounded range 77 (f) with a small coefficient e, i.e. r(f) —)> r{t) + and equating to zero 

the variation 


5A{T) 

5e 


dt r]{t) log T dt 2 r{t)r]{t) 


t r{t) 


dt'r(t')^ 


= / df 77 (f) 


1 


logT r(f) 
i^(t) f^dt'r(t')^ 


= 0 , 


Since this has to hold for all 77 (f) it implies r(f) = dt' r(t')^ j \/f logT and thus (noting that the 
integral on the right hand side is simply a number, which has to be finite for the solution to be valid) we 
have r(f) ~ _ 


Alternative optimization using Lagrange multipliers A slightly more standard approach to opti¬ 
mizing our objective, which ultimately leads to the same result, but avoids nested integrals, is to use a 
Lagrange multiplier A to keep the noise integral fixed, while maximizing the (logarithm of the) signal. 
The modified utility functional reads 


A{T) 



df 

f 


log r(f) 




( 12 ) 


Using the same variational expansion r(f) —)• r(f) -|- e 77 (f) we find that 


bA{T) 

be 


df 77 (f) 
f r(f) 


— 2xJ dtr{t)rj{t) = J df 77 (f) 


tr{t) 


— 2 Ar(f) 


= 0 r(f) = 




The value of the Lagrange multiplier is fixed by df r(f )^ = log T/(2A). 
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Scaling behavior of the optimal solution For an inverse square root time dependence of the memory 
trace, the resulting signal to noise ratio for N synapses behaves as S/J\f{t) = ^Nj {t log T), which 
means that the memory lifetime t* at which it drops to one is = N/ log T. In particular, if we choose 
T = N, we would obtain a memory lifetime t* = N/ log N, which is however not quite consistent 
with our assumptions, since t* < T, i.e. the signal to noise ratio drops below the threshold before the 
cutoff time. On the other hand, choosing T = N/\ogN leads to t* = N/{\ogN — log log A^) = 
N/\ogN + A/^(loglogiV)/(log + ..., which is consistent as we now have t* > T. The optimal 
choice for T, where it is equal to t*, is determined by the equation T = N/ log T, which does not have 
a solution in terms of elementary functions. However, we can expand the resulting t* for large N and 
find that the first two terms are the same as above, and thus we conclude that we can achieve a memory 
lifetime t* = 0{N/log N). The coefficient in front of N/ log N depends on the value of the retrieval 
threshold of course (and is one for our arbitrary choice of unit threshold). The corresponding initial 
signal to noise ratio scales as S/M{t = 1) = 0 {\/N~[\ogN). 


Alternative regularizers We have optimized the area under the signal to noise curve in a finite interval 
1 < f < T, which may not appear to be exactly what we want, since it involves choosing a cutoff T (and 
for a specific application it may not be clear to begin with what T should be). However, since the optimal 
solution r{t) ~ is independent of T, we are free to just set it a posteriori to any value smaller than 
or ideally equal to the memory lifetime. 

Nevertheless, recalling that we want to construct a time-invariant system, we should really be considering 
the area above the threshold on an semi-infinite interval f > 1, in which case there is an issue with naively 
extending the solution r(t) ~ to infinitly large t. The cumulative noise term = f dtr(t)^ now 
receives contributions from all previously stored memories (even arbitrarily old ones whose signal to 
noise ratio has dropped below threshold) and thus diverges logarithmically as the range of integration 
is extended to infinity, which means that the solution needs to be regularized to be viable. Of course 
everything would be perfectly consistent if we simply declared that r(t) decayed as the inverse square 
root of t until time T, then instantaneously dropped to zero and remained there for all times larger than 
T (i.e. r{t) = 0 for f > T). This is what we have effectively assumed in our derivation above when 
we set the integration range of the noise term to extend up to the cutoff time T only. However, such 
discontinuous behavior does not seem very natural, which is why in what follows we will be considering 
other regulators that allow us to smoothly extend r{t) as t ^ oo. 


One simple regulator involves modifying our solution by an exponential with a long timescale r (com¬ 
parable to T above), such that r{t) = exp(—^) decays rapidly for t ^ r. The resulting signal 
to noise ratio, after performing the sum in eqn. Q, is given by 


S/Af{t) 


N 


N 


—t e*/’’ log(l — 6“^/"^) — 1 


f logr 


Alternatively, we could modify the power law to decay just slightly faster than the inverse square root of 
time, such that r{t) = and therefore 


S/Af{t) 


N 


£—^0 


fi+H(l + e)-l 
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where ( is the Riemann zeta function. In both cases, upon setting t = N and e = 1 /log N, respectively, 
we find that the leading order scaling behavior remains 0 {\/WJ\o^^) for the initial signal to noise 
ratio, and 0{N/ log N) for the memory lifetime. 


Different choices of objective functionals One might wonder why we choose to optimize the area 
under the log-log plot of the signal to noise ratio, rather than some other functional of the signal to noise 
curve. The intuition behind this is clear: while we want to have a big S/Af to be able to retrieve a 
memory from only a small (highly corrupted) cue, we do not want to spend all our resources making 
an already large signal to noise ratio even larger, and thus we discount very large values by taking a 
logarithm. Similarly, while we want to achieve long memory lifetimes, we do not focus exclusively 
on this at the expense of severely diminishing S/J\f, and therefore we also discount very long memory 
lifetimes by a taking a logarithm]^ While putting less emphasis on extremely large signal to noise ratios 
and extremely long memory lifetimes is very plausible, the use of the logarithm as a discounting function 
is of course simply a matter of mathematical convenience, and one could certainly imagine choosing a 
different functional form of the objective to be optimized. 


Let us briefly discuss what would happen if one did not employ such a discounting procedure. First, 
one might be tempted to simply maximize the memory lifetime, i.e. not discount very long lifetimes. 
This would be appropriate if we did not care about memory retrieval from small cues, which require 
large signal to noise ratios, but we were in a (somewhat artificial) situation where we only ever needed 
to recover memories from a cue of a fixed level of accuracy, such that there would be little benefit to 
having a. S/J\f larger than some number of 0(1). In that case the objective functional, along the lines of 
eqn. would be 



dt logr(t) 




()e 



2 Xr{t) 


= 0 , 


which implies r{t) = = constant. Similarly, if we also did not discount large signal to noise 

ratios and simply optimized the area on a linear plot, we would find 


.4l"(T) = dtr(t) -X 


dtr{tf 


= dtv{t)[l-2Xr{t)]=0, 


such that r{t) = 1/(2A), which is again constant. In both cases the noise term diverges linearly with 
T, which means that an appropriate regularization is pertinent. If we simply cut off the memory trace at 
t = T, such that r{t) becomes a step function, we would find S/M{t) rsj xjN/T for 1 < t < T and zero 
otherwise!^ Alternatively, if we use an exponential regulator r{t) = exp(—^), as proposed in 


Mezard 


In particular, there would be no point in having memory lifetimes longer than the lifetime of the animal in question, which 
can also be encoded by an appropriate choice of cutoff T above. 


‘*This amounts to adding the last 'I' plasticity events with equal weights to compute the synaptic efficacies, and when 

embedded in a recurrent network (typically using a covariance learning rule) is essentially a time-translation invariant version 
of the Hopfield prescription i HopfieldH1^82jAor^^ichjlw number of storable memories scales linearly'with N. Noti” ”h^ 
in the case of the Hopfield network N is also the number of neurons. Even though in a fully connected network of N neurons 


there would be of order N‘‘ synapses, they would not all be statistically independent because different neurons receive basically 


the same input (any two neurons share N — 2 inputs). A set of statistically independent synapses would be those on the dendritic 

tree of any particular neuron (for independent inputs), and therefore no larger than the total number of neurons. 
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et al. 


(19861, eqn. fl leads to S/M{t) VWr exp(—^). In both cases choosing the cutoff timescale 


(T or r) to be 0{^ leads to a memory lifetime that is also 0{N), but at the cost of reducing the (initial) 
signal to noise ratio to 0(1). Note also that while such a model would achieve a slightly better scaling 
of the memory lifetime (by a factor of log N), the range of the synaptic we ight (which i s tied to the 


noise term) would have to grow polynomially as 0{^/N), as in the proposal of Paris! (19861, rather than 


logarithmically. This means that a discretized synapse would require a huge number of discrete levels. 


Finally, we might decide that for a particular application we really care about having a very high signal 
to noise ratio, even at the expense of a shorter memory lifetime, and thus discount only the latter by a 
logarithm, which leads to 







2 Xr(t) 


= 0 . 


In this case we hnd an inverse proportionality of the memory trace to time, r{t) = l/(2Af), which 
is appr oximately realized by the cascade and multistage models of Fusi et al. ( |2005 1; Roxin and Fusi[ 
(20131. The noise integral is convergent as T —)■ oo, such that no further regularization is needed 
(and correspondingly the required range of the synaptic variables is 0(1) only). The resulting signal to 
noise ratio behaves as S/J\f(t) ~ ^/N/t, which means that both its initial value and the memory lifetime 
are 0(^/N), i.e. the lifetime is signihcantly reduced. 


In summary, we could certainly write down other objective (utility) functionals to optimize, which would 
lead to different plasticity-rigidity tradeoffs. Our solution has the beneht of exhibiting polynomial scaling 
with the optimal exponents (up to logarithmic corrections) for both the initial signal to noise ratio and 
the memory lifetime, with a smooth power-law behavior in between, and importantly requires only a 
small dynamical range of 0(\/log N) for the synaptic weight. By modifying the objective, as discussed 
above, or deforming the solution (e.g. consider the power-law regulator with hnite e), one could 
either increase the initial signal to noise ratio from 0 (^yWl]oglV) to 0 {^/N) at the expense of a shorter 
memory lifetime, or improve the memory lifetime from 0(N/ log N) to 0(N) while paying a price in 
terms of the initial memory strength. 


S.2 Variance and spatial correlations of the dynamical variables 


We can use the continuum diffusion model of Section M.4 to obtain some insights on correlations of 
the dynamical variables. Recall that in the (spatial) continuum limit of our model, each variable cor¬ 
responds to an interval on the one-dimensional space along which the synaptic inputs diffuse (here we 
are essentially undoing the coarse graining that turns a single partial differential equation for u(x',t) 
into a set of coupled differential equation for the Ui(t) variables). The impulse response of the model 
on an inhnite line is described by the Green’s function of eqn. 0- and restricting to non-negative x' by 
introducing a boundary at the origin merely modihes this by a factor of two (the Neumann boundary 
condition dujdx' = 0 imposing vanishing flux at x' = 0 is automatically respected for inputs at the 
origin due to the symmetry of the setup). 


Introducing a second boundary with a Dirichlet boundary condition (namely ti = 0) at a particular value 
of x' to limit the space to a finite interval is somewhat tedious, but feasible using the method of images. 
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For our purposes, however, it is simpler to stick with a semi-infinite line, and model the essential effect 
of the second boundary by introducing an exponential cutoff term with a long timescale T. In other 
words, we can use the Green’s function for a one-dimensional diffusion equation with an added decay 
term, which reads 


Guix’,t) 


1 _± 
e 4Dt e T . 

y/irDt 


The exponential cutoff ensures that the diffusion process cannot proceed for much longer than a time T, 
which would correspond to the time it takes to reach the hypothetical second boundary. While u{x', t) 
is not strictly zero there (as would be the case for a true boundary with Dirichlet condition), it becomes 
exponentially small for larger x'. 


For a series of inputs (plasticity events) at integer-spaced times t' we can write in analogy to eqn. 


© 


u{x', t) = Aw{t') Gu{x', t — t') , 


where the value of u{x', t) in the vicinity of x' = 0 corresponds to w{t), and the decay of the Green’s 
function close to the origin to the kernel r{t — t'). For {Aw) = 0 the expectation value of this sum 
vanishes, and with {Aw{t) Aw{s)) = 6 t^s the spatiotemporal covariance is equal to 


{u{x[,tl)u{x2,t2)) = ^ Gu{x'i,ti -t')Gu{x2,t2 -t') . 

t'<min(ti,i2) 


We are particularly interested in the correlation at equal times ti = ^ 2 ^ and as above we approximate the 
sum by an integral, which leads to the (stationary) spatial covariance 


PC 

{u{Xi)u{x2)) ~ / 
Jo 


dt 


TrDt 


,2 

a^i -h^2 
4Di 


e T = 


ttD 


Kn 



(13) 


expressed as a modified Bessel function of the second kind. 


Unsurprisingly, this implies that there are long-range correlations, since the Bessel function becomes 
(exponentially) small only for arguments much larger than one, which would require at least one of x\ or 
X 2 to lie beyond the second (softly imposed) boundary. Because the right hand side of eqn. ( [13] ) depends 
only on the combination x'^^ + x' 2 ^, plotting it for the case x'l = x '2 = x' (which gives us the variance) 
is sufficient to also read off the spatial covariance. We use the relation x' = — 1 to rewrite this in 

terms the spatial variable x that is linearly related to the index i of the variables Ui in the discretized 
model (such that each variable is associated with an x interval of equal size; see Section [M4|), and plot 


the variance vs x in Fig. S1 


This plot diverges for x —)■ 0, because we have naively taken the lower limit of integration to be zero in 
eqn. ( [T^ . It is easy to resolve this issue by introducing a finite lower limit of the integral, and in a further 
approximation we can also replace the soft exponential cutoff by a hard upper limit of T/2. This leads 
to the following expression involving incomplete Gamma functions 


{u{x[)u{x'2)) 


pT/2 

' 1/2 


dt 


' 2 , 
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TlDt 
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Figure SI: Plots of the approximate variances (rt^) that follow from eqns. ( [T^ and ( [Til ) x'^ = x '2 = 
g/3^ — 1 vs X, in orange and blue, respectively. Here we set the parameters (which are related to the 
constants a, m and n of the discretized model) to /? = 1, Z) = 1 and T = 10®. 


which is finite at the origin. In fact its value at x'l = x '2 = 0, which corresponds to the variance of the 
synaptic efficacy, is proporfional fo log T. Since in fhe discrefe linear chain model we have T = 
and fhe sfandard deviafion of fhe synaptic weighl disfribufion closely approximafes fhe memory noise 
(mulfiplied by a factor of ^/N), fhis is in agreemenf wifh fhe 1 / scaling behavior of fhe inifial signal 
to noise ratio shown in Fig.|^. 


Afler fhe change of variables from x' fo x we can again plof fhe variance fhaf follows from eqn. 
see Fig. S1 The crucial observafion, bofh from fhis graph and fhe plof of eqn. ( [T3] ), is fhaf fhe variance 
decreases almosf linearly over a wide range of x values (as long as we are nol close fo one of fhe 
boundaries). Coarse graining fhis picfure fo obfain fhe spatially discrefe linear chain model leads to 
variances of fhe variables Ui fhaf are proporfional fo log T and fall off linearly along fhe chain (i.e. wifh 
fhe index i). This is consisfenf wifh fhe simulafions shown in Fig.[^ and explains why fhe long fimescale 
variables require only small dynamical ranges. 


S.3 Parameter variations, potentiation/depression imbalance, and model deformations 


As illustrated in Fig.[^ to achieve the l/y/t decay we have tuned the parameters of the model and as¬ 
sumed that the synaptic modifications are balanced. Given the heterogeneity and variability in biological 
systems, it is important to test whether the model is robust enough to operate also when the parameters 
are not finely funed. We show in Fig. S2 \ fhaf fhe SNR of fhe perfurbed model clearly deviates from 
fhe SNR of fhe unperfurbed model. However, fhe deviafion increases slowly and smoofhly wifh fhe am- 
plifude of fhe perfurbafions. The SNR sfill decays approximafely as 1/ ^/t, buf fhe time af which fhe 
exponenfial breakdown appears becomes progressively shorter as fhe perfurbafions grow. 


If is importanf fo sfress fhaf for long timescales fhere are sfill synapses fhaf refain fhe fracked memory. In 
ofher words, fhe memory signal is sfill significanfly differenl from zero in a subpopulafion of synapses 
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which happen to be well tuned. When reading out all synapses, this signal is too small compared to the 
noise. However, a smart selection mechanism, as the one suggested in |Roxin and Fusi[ (2013), would 
enable the neural circuit to read out the memory even when the SNR of the entire synaptic population is 
too small. 


The memory performance degrades differently when the synaptic modifications are imbalanced (see 
Fig. [S2^). The decay remains almost unaltered, but the SNR curves are shifted downwards. Although 
fine-tuning does not seem to be needed, the memory system is clearly sensitive to imbalances in the 
statistics of potentiation and depression. 

Note that the required balance condition refers to the effective rates of potentiation and depression events 
that actually occur (or more precisely that would occur if there were no limits on the dynamical range), 
not to the relative rates that might be imposed by a certain learning rule depending on the neural activity. 
If for example, given a certain learning rule and statistics of neural activations, potentiation events were 
called for more frequently than depressions, a homeostatic mechanism internal to the synapse could 
compensate for this by scaling down the relative magnitude of potentiation steps (or the probability of 
actually executing it when demanded by the learning rule). As long as the statistics of neural inputs 
do not fluctuate too strongly, such a mechanism can achieve potentiation/depression balance using only 
information locally available to the synapse. 

The essential effect of such a balance condition is to keep the steady state distributions of the synaptic 
variables centered, and prevent it from becoming too concentrated near one of the boundaries of their 
dynamical range, since desirable modifications that would take one of the variables beyond the limits of 
its dynamical range are not possible, and therefore lead to errors that degrade the memory trace. 

While the model is robust to detuning the parameters of its dynamical equations (see Fig. [T] or more 
generally eqn. ([n])), there are certainly ways of deforming it that will drastically reduce the memory 
performance, by effectively introducing new terms in the equations. As in any memory model, there is 
an almost conserved quantity, which in our model is a weighted sum C^Uk analogous to the total 
amount of liquid in all beakers. This quantity would track the sum of all past inputs (i.e. plasticity events 
adding or removing liquid), except for the leakage term that connects the longest timescale beaker(s) 
to the reservoir, which causes it to slowly decay towards zero on a timescale T. Because of this term, 
the total amount of liquid is not precisely conserved (i.e. equal to the sum of past inputs), but only on 
timescales shorter than T (corresponding to a weighted sum of past inputs with an exponential discount 
factor). 

Recall that this leakage takes the form of an unbalanced term on the right hand side of the last one 
of the dynamical equations (which otherwise consist of balanced difference terms, representing nearest 
neighbor interactions). This term is obtained by setting Um-i-i = 0 in the case of the linear chain in 
eqn. Q. It serves an important purpose, which is to keep the variance of the dynamical variables finite, 
i.e. it ensures that the system can settle into a steady state. However, if there were any additional leakage 
terms of this form with a coefficient larger than 0{1/T) in the dynamical equations of any of the Uk 
variables, it would introduce a new decay timescale inversely proportional to that coefficient, and this 
would limit the memory lifetime such a synapse could achieve. 

More generally, whatever the biophysical mechanisms may be by which a synapse stores information 
for a time T, they clearly have to respect the existence of some quantity that is at least approximately 
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Figure S2: Robustness: Effects of parameter variations and potentiation/depression imbalance. A. SNR 
vs number of memories when the parameters of the model are perturbed. The coupling constants be¬ 
tween different dynamical variables (i.e. the cross-section of the tubes connecting different beakers) are 
multiplied by stochastic variables drawn independently from a log-normal distribution. The mean of 
these variables is one and their standard deviation s is indicated in the figure for the different curves. 
Darker curves correspond to larger perturbations. The SNR of the perturbed models deviates from the 
l/\/f behavior. However, the difference in memory performance is still quite modest even for standard 
deviations that are rather large. Notice that for the curve corresponding to the smallest non-zero pertur¬ 
bation, the standard deviation is already as large as 50% of the mean value of the perturbed parameter. 
B. SNR vs number of memories when the rates of potentiation and depression are increasingly imbal¬ 
anced. The power law forgetting curves are very similar, but they are shifted downwards significantly, 
even for fairly small imbalances. The memory performance is rather sensitive to an uncompensated 
potentiation/depression imbalance. 


conserved on timescales of order T. 


S.4 More general readout schemes and synaptic non-linearities 

We have assumed up to now that the synaptic efficacy is simply equal to ui, and therefore, like all the 
variables u^, has a bounded dynamical range. However, we could consider more general expressions for 
the synaptic efficacy, which may in principle depend on all of the internal variables Uk, and need not be 
linear. 

Such a general readout scheme would allow for an even larger class of models. Of course, reading out 
internal variables for A: > 1 would contradict the notion that these variables are hidden, i.e. internal 
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Figure S3: Plot of the SNR in the fully discretized model vs t for two different readouts: ui in black 
and sign(ui) in red. The parameters are the same as for the black curve in Fig. |^, with m = 10, 
N = 5.4 X 10® and 40 levels for each variable. 


to the synapse, and therefore do not communicate with the outside world (the neural network) directl)0 
For this reason, we will not pursue such general output schemes further here. 


We will, however, briefly discuss a more restricted class of readouts that depend only on ui, albeit in 
a non-linear fashion. In this case the synaptic efficacy could take the form of a sigmoidal function of 
ui, possibly restricting the already limited dynamical range of ui even further. In the limit of infinite 
(central) slope this output degenerates to a binary readout that takes values of say ±1, in which case 
we can write the synaptic efficacy as sign(rti). Such a model would thus represent a binary synapse, 
but with complex internal dynamics determining the binary value of its efficacy, and with inputs still 
acting directly on ui. (Other mappings of ui to larger discrete sets are certainly conceivable as readouts, 
leading to synapses with more than two possible values of the efficacy.) 


Interestingly, reading out sign(rii) instead of ui only has a small effect on the signal to noise ratio. 
Its absolute value is slightly lower, reflecting the fact that we are now reading out at most one bit of 
information per synapse (compared to potentially several bits if the synaptic efficacy is equal to ui, 
which may be continuous or discretized into multiple levels depending on which version of the model 
we consider), but the functional form of the signal to noise ratio and the resulting scaling behavior appear 
to be unchanged; see Fig. This is not unlike taking the sign of the weight of a simple unbounded 
synapse after adding the contributions from all memories (i.e. offline learning), which is known to lead 
to an extensive memory capacity ( Sompolinsky[ 19861, except that here the learning happens online and 
using bounded variables. 


°Note in particular that it was precisely the idea of having only a single variable that serves as both the input and the output 
of the synapse that led to models with bidirectional interactions between the variables. Information enters via ui, is distributed 
across longer timescale variables, and those in turn have to back-react on ui, because that is where the information has to 
eventually be read out again. 
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S.5 Autocorrelation function 


Here we compute the autocorrelation function that follows from the simple ansatz in eqn. Q, which 
proposes a synaptic weight that is a linear combination of past plasticity events multiplied by an age- 
dependent decay term 


w{t) = ^ Aw{t') r{t — t') . 

t'<t 

The variance of the weight of such a synapse (again assuming {Aw{t)) = 0) is given by 

r{t - t') r{t - t”) = ^ r2(t - t') , 

t'<t t"<t t'<t 

since the expectation value of a product of two synaptic modifications (of unit magnitude) leads to a 
Kronecker delta Similarly, the covariance of the values of the synaptic weights at two different 

times is given by 

C{t,t + At) = ((w{t)w{t + At)) = {Aw{t') Aw{t")) r{t — t')r{t + At — t") 

t'<t 

= ^ r{t — t')r{t + At — t') . 

Note that the expression that appears inside the sum over t' above depends on t and t' only in the com¬ 
bination t — t', which is the age of the synaptic modification (and thus we can rewrite these formulae 
accordingly, summing over the age of memories). If the synaptic weight distribution reaches a steady 
state, as is the case if the kernel r(t — t') decays sufficiently fast (such that the sums converge), the 
variance will be constant in time, and the covariance will depend only on the delay At. 

In case the kernel function does not quite decay fast enough it has to be regularized appropriately. A 
simple way to achieve this is by stipulating that the synaptic weight depends only on modifications Aw 
of ages up to some (large) time constant T, thus cutting off the sums above after at most T steps. In 
other words, we effectively alter the kernel function such that it vanishes for all but the T most recent 
modifications (but is otherwise unchanged). A similar effect could be achieved in a different manner, 
by multiplying the original kernel function by an exponential with a long timescale T, leading to a soft 
cutoff (as discussed in Suppl. Info. |S.l| ). 

The autocorrelation function A{At) is then given by the ratio C{At)/a‘^. For example, if the kernel 
under consideration was r{t) = t~^, we would find a variance and a covariance 

C{At) = + At)~^ = where represenfs fhe A:fh harmonic number. Since fhese 

infinife sums converge, no regularizafion is needed in fhis case, fhough if could be incorporafed if desired, 
leading fo slighfly more complicafed expressions again involving harmonic numbers. 

For more general kernels, however, possibly wifh culoff, we wonT be able fo perform fhe required sums 
in closed form, and if will be useful fo resorf fo an infegral approximafion. By analyfic confinuafion 
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of the decay function to real arguments, and replacing the sums by integrals, we obtain the following 
approximate autocorrelation function A{At) 

r{t) r{t + At) dt 
dt' 

Here we have incorporated the cutoff T in the upper limit of the integrals (assuming positive At), but it is 
easy to remove this cutoff by extending the range to infinity (T —)■ oo) if this doesn’t cause the integrals 
to diverge. The numerical accuracy of this expression could be improved by a continuity correction in 
the limits of the integrals, but we will not pursue this further her^ 

For the optimal kernel r{t) = with cutoff T the above integral expression leads to 


A{At) = 


fl 


AxiAt) 


2 , f VT + VT-At\ 

log(r) 1 + ) 


for At < T — 1. First taking the limit of large cutoff T and then also taking the delay At large (though 
still much smaller than the cutoff) this reduces to 


Aj'^At) 


T^oo,At^oo^ log(Af/4) / At At \ 

' log(r) riog(r)J 


(15) 


Thus if we plot the autocorrelation function versus log(Af) it will be well approximated by a straight 
line with a slope determined by the cutoff time in this regime. There are, however, significant deviations 
from this behavior for small At and for At close to the cutoff. 

Another kernel that we are interested in is the more general power law r(t) = in particular for 

small e > 0. In this case the integrals converge and we do not have to impose a cutoff (though it could 
be added without much difficulty). We find 

_ e7r3/2 2"At-^csc(7re) 2e (1 + At)(i-^)/2 2 F 1 (l, 1 - e; -^) 

’ r(l-|)r(i^) At{l-e) 

which involves gamma and hypergeomefric funcfions. This expression represenfs fhe aufocorrelafion 
funclion for an arbifrary power-law kernel (decaying fasfer fhan t“^/^). However, we can fake similar 
limifs as above, firsl e —)• 0 and fhen large At, which leads to 

A^{At) 1 — elog(At/4) + O At“^/^, log^(At)^ . 

Again the autocorrelation function approximates a straight line in this regime when plotted against 
log(At), here with slope minus s, and if we identify e with l/log(r) this is in fact the same limit¬ 
ing expression as in eqn. ( [TS] ) for the inverse square root kernel with cutoff T. In other words, both small 

^Typically for monotonically decreasing decay functions most of the error in the approximation is accrued close to the lower 
limit, which one should correspondingly pick somewhat smaller than one to reduce this discrepancy. The expressions for the 
autocorrelation given in this section can be generalized rather easily to arbitrary lower limits of the integrals. 
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positive e and large cutoff T can act as approximately equivalent regularizations of the inverse square 
root kerned 

This also gives us an idea of how closely we could possibly expect an observed synaptic decay function 
to approximate the idealized inverse square root kernel. If the observed decay is described by a power 
law with exponent minus (1 + e)/2, this is sufficient for the synapse to achieve a memory lifetime 
T ~ exp(l/e), i.e. it does not actually have to come extremely close to the case to exhibit a 
substantial memory capacity. 


S.6 Estimating autocorrelation functions from finite length experiments 


In order to measure the autocorrelation function experimentally by observing the response of a synapse 
to a random sequence of plasticity protocols, it is necessary to first calibrate the relative rate of potenti¬ 
ation and depression protocols. Since one does not know the relative rate of (desired) potentiation and 
depression events the synapse in question would experience in its natural environment, and furthermore 
cannot be sure that such natural plasticity events are similar in magnitude to those induced by the chosen 
experimental protocols, using a random sequence of protocols of sufficient strengths to observe an effect 
and with a predetermined fraction of potentiation events would likely drive the synaptic efficacy fo fhe 
edge of ifs dynamical range rafher quickly. The synaptic weigh! may become effectively sfuck in ifs mosf 
depressed sfafe (if fhe pofenfiafion protocol is weaker or less frequenf fhan fhe synapse is accusfomed 
fo) or in ifs sfrongesf sfafe (if fhe pofenfiafion protocol is loo frequenf or effective, and vice versa for fhe 
depression profocol), which would prevenf us from measuring fhe relevanf aufocorrelafion funcfion. 

Thus fhe experimenfer has to ensure fhaf fhe relafive frequency of pofenfiafion and depression prolocols 
is chosen such lhal fhe synapfic efficacy is nol sysfemafically pushed up againsl one of ifs boundary 
values. The offending drifl ferm due fo a pofenfiafion/depression imbalance (relafive fo fhe nalural in- 
puf condifions for fhe synapse in queslion) has fo be kepi small enough fo prevenf fhis from happening 
Ihroughoul fhe measuremenl period. In olher words, if has to be smaller in magnilude fhan fhe dynamical 
range divided by fhe lenglh of fhe experimenl, which in lurn implies fhaf fhe calibration period required 
fo achieve fhis has fo be of a similar durafion as fhe experimenl ifself (assuming fhaf homeosfalic mech¬ 
anisms inside fhe synapse cannof adapf fo changing inpul slalislics on fimescales as shorl as fhe lenglh 
of fhe experimenl). Also, fhe slrenglh of fhe plaslicily protocols has fo be chosen jusl large enough for 
a single one of Ihem fo effecl an observable change in fhe synapfic weigh! (al leasl wilh some finile 
probabilify), bul if al all possible no! so large fhaf fhe magnilude of fhis change would be comparable to 
fhe full dynamical range of fhe synapse. 


In addifion to fhis calibration procedure, one has fo consider carefully how accurafely one can hope to 
esfimale fhe Irue aufocorrelafion funcfion from a finile amounf of dala. The compulalions in Seclion 


S.5 assumed for simplicily lhal fhe mean synapfic weighl vanishes (i.e. (w) = 0, as was our convenlion 


above), bul when measuring synaptic efficacies in physical unils we do nol know fhis mean value a priori. 
Of course if can easily be eslimafed from fhe measured dafa, bul using fhis leads fo a biased eslimalor 


’it should be noted, however, that for finite e and T the autocorrelation function Air{At) looks quite different from 
for sufficiently large At, since it does not have to go to zero at a finite cutoff, and exhibits a long tail with positive curvature. 
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(especially so for short experiments) of the autocorrelation function 


^est (^f) 


{w{t) — w){w{t + At) — w) 
{w{t) — 


( 16 ) 


Here overbars indicate temporal averages that are approximated by sample averages over the available 
data points (of which there are fewer for larger At). 


In Fig.j^we have simulated such an idealized experiment using the simple model of eqn. Q with different 
memory decay functions, and computed the autocorrelation function assuming the true mean of the 
synaptic efficacy is known. If, on the other hand, we had to estimate the mean synaptic weight from 
the data before plotting the autocorrelation function according to eqn. ( [T^ , the result would look rather 
different (and highly distorted), as shown in Fig. [S4|4. The curves for different cutoff times lie almost 
on top of each other, since the finite length of the experiment essentially imposes a lower bound on 
the magnitude of the slope of the (log-linear plot of the) autocorrelation function. In other words, the 
autocorrelation function estimated in this way would not be very flat, even if this was the case for the 
true autocorrelation function (say for a power-law decay with exponent close to minus one half), which 
could give the false impression of a faster memory decay. 


For reference, we show in Fig. S4 3 the same graphs for an an unrealistically long experiment (with 10^ 
data points instead of 10^), in which case the bias becomes small, and both estimators (with known mean 
synaptic weight or without) approach the theoretical autocorrelation functions computed in Section [S75| 


For the simple kernel models described by eqn. Q there are no hard limits of the dynamical range, but 
the variance of the synaptic efficacy is neverfheless bounded for fhe chosen decay funcfions. The lim¬ 
ited dynamical range of real synapses may help alleviate this bias problem, since during the calibration 
procedure the experimenter is likely to encounter the extreme values the synaptic efficacy can take (by 
pushing it against those boundaries) and can use this additional information to establish a better estimate 
of the mean synaptic weight than one could obtain from the measurement phase alone. 


S.7 Comparison with other models and optimal complexity 


Previous synaptic models are characterized by different scaling properties of the initial signal to noise 
ratio and the memory lifetime, the two quantities that we use to summarize the memory performance of 
sy naptic models (se e the Discussion and Table [^: for the best complex models, like the one proposed 


Fusi et al. 


(2005|, both these quantities scale like y/N, where N is the total number of synapses. In 


m 

the model that we propose in this manuscript, the initial SNR still scales essentially like Vn, but the 
memory lifetime is greatly extended, as it scales almost linearly with N. In this section we compare 
systematically the memory performance of different synaptic models. First, it is important to notice 
that all synaptic models that we consider are characterized by a certain number of parameters which 
determine the complexity of the synapse. To perform a fair comparison, it is important to choose the 


optimal parameters for each model. In the case of the cascade model described in Fusi et al. (20051, the 
number of levels fh of the cascade is a measure of the complexity of the synapses. In the case of our 
model, it is the number of variables m. For simplicity we ignore the number of levels in the discrete 
version of our model. The memory performance depends on both the complexity of the synapse and the 
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Figure S4: Log-linear plots of the autocorrelation function estimated from simulated data using eqn. ([^ 
with different decay kernels vs At -|- 1. The five models shown are the same as in Fig.|^ and for com¬ 
parison the dashed lines in both panels reproduce the autocorrelation functions plotted there. A. The 
estimator of eqn. ( [T^ that approximates the mean synaptic weight using the data (solid lines) introduces 
a significant bias, especially for slowly decaying memory traces. The curves for the three models with 
inverse square root decay function (and different cutoff times T = 10^, 10^ and 10®) lie almost on top 
of each other, and exhibit a rather steep slope. This estimator does not correctly capture the broad auto¬ 
correlation functions of models with slow decays and long cutoff times, but would still be sufficient to 
distinguish the proposed model from significantly faster power law-decays. B. Autocorrelation functions 
estimated from simulated experiments with 10^ data points (solid lines), instead of 10^ (dashed lines and 
panel A). In this case the difference between using prior knowledge about the mean efficacy or esfimafing 
if from dafa is negligible (nol shown), buf fhe aufocorrelafion funclions for fhe l/\/i models compufed 
from such an unrealisfically long experimenf are significanlly improved compared fo fhose esfimafed 
from less dafa, and are consisfenf wifh fhe slopes fhaf follow from eqn. (151. 


fofal number of synapses N. For a given number of synapses, fhere is an opfimal complexify. For all of 
fhese models, small memory sysfems perform heller wifh simpler synapses and complexify is required 
only for memory systems wifh large N. For example, in Figure [S5] 4 we show fhe memory lifelime as a 
funclion of fhe complexify m for our model for differenl numbers of synapses N. We use fhe following 
empirical expression fo approximale fhe SNR 


S/J\f{t) ~ 0.8 


N e-*/^ 

t vw ’ 


where T ~ 6 x 4"^ is fhe longesl limescale of fhe model. The dependence on fhe parameters m and 
N was oblained from fhe confinuum model of Sections |M.4| and |S.2| and fhe numerical conslanls were 
delermined by tiffing Ihis expression fo fhe SNR of simulafions of fhe fully discrelized model. We plol 
a differenl curve for each value of N (namely 10®, 10®, 10^ and 10®). All curves grow wifh complexify, 
reach a maximum and Ihen slowly decay. From fhe expression of fhe SNR if is clear why fhe memory 
lifetime is a non-monofonic funclion of m: T appears bofh in fhe exponenfial and in fhe logarilhm 
in fhe denominafor. When fhe complexify increases, fhe range over which fhe 1/y/t decay prevails is 
exfended, as T increases exponentially wifh m. However, fhe initial SNR slowly decreases, as expressed 
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Figure S5: Choosing the optimal complexity and comparing the memory performance of different synap¬ 
tic models. A. Memory lifetime as a function of synaptic complexity m (i.e. the number of variables) in 
the case of the proposed model. Different curves correspond to different numbers of synapses N. For 
each curve there is an optimal complexity. As the number of synapses increases, the peak shifts to the 
right and the optimal complexity increases. B. Comparison between five different synaptic models: our 
proposed model, the cascade model, a simple bistable model, and two heterogeneous population models. 
The five curves are fhe SNRs for fhe differenf models as a funclion of fhe number of memories (or time, 
represenfed on fhe fop axis, as in fhe SNR plofs in fhe main fexf). 


by fhe logarifhm in fhe denominafor. If fhe SNR is well described by a power law for mosf of fhe fime 
fhroughouf which SNR > 1, fhe exponenfial in fhe SNR expression can be expanded, and fhe memory 
lifetime t* can be approximafed by 


t* 




0.64 T 

1.28+ (r log r)/iv ■ 


This clearly shows fhaf t* increases linearly wifh T, and hence exponentially wifh m, as long as T log T 
is negligible compared fo N. 


To compare differenf models, we need fo choose fhe opfimal complexify for each model. When juxfa- 
posing our model and fhe cascade, we fixed fhe number of synapses N = 10®, and used fhe opfimal 
paramefers m = 14 and ttt, = 16 fhaf maximize fhe memory lifetime. The resulting SNRs are plotted 
in Figure [S5^. For reference, we also show fhe SNRs of fhe simple synapfic model of Amif and Fusi| 
( |1994| l, in which we choose fhe learning rafe fhaf maximizes fhe memory lifetime, and of fwo hefero- 
geneous models in which memories are stored in mulfiple populafions of simple bisfable synapses fhaf 
are characferized by differenf learning rafes (see Roxin and Fusi[ 2013[ and Section [S^ . The sizes of 
fhe synapfic populafions have been chosen fo produce fwo differenf power-law decays of fhe SNR: for 
one curve fhe decay is 1 /t, as in fhe cascade model, while for fhe ofher if is 1/ y/i, as in fhe proposed 
model. Even fhough fhe heferogeneous models can also harness processes fhaf operafe on a wide range of 
fimescales, fhese processes are nof inferacfing (i.e. fhe variables are independenf). Our proposed model 
has an inifial SNR fhaf is only slighfly larger fhan fhaf of fhe cascade model, and orders of magnifude 
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larger than in the simple synaptic model. However, as already discussed in the main text, the memory 
lifetime is several orders of magnitude larger than in any previous model of bounded synapses. This 
improvement would be strongly reduced for smaller memory systems (i.e. for lower N). This means that 
for smaller brains it may be wasteful or even counterproductive to have synapses that are too complex. 
To take advantage of complexity, it is important to have a sufficient number of memory resources (in our 
case the number of synapses). 



Time Dependence of S 

Initial S/Af 

Memory Lifetime 

Unbounded 

consl. 

< o{^/n) 

0(N) 

Bisfable (fasf) 

~ 

o{Vn) 

0{\og(N)) 

Bisfable (slow) 

~ 

0 (1) 

o(Vn) 

Helerogeneous 

~ 1 /t 


0 (lofAl) 

Cascade 

~ 1/f 

Olio'S) 

Olio'S) 

Mullislage 

~ 1/t 

^ (\/ logtv) 

^ (\/ logTv) 

Proposed Model 

~ 1/\/t 

^ (f/logtv) 

^ (logiv) 


Table 1: Approximate scaling properties of different synaptic models. “Unbounded” refers to models in 
which the synaptic variables can vary in an unlimited range, as in the Hopfield model ( Hopfield] 1982 1 , 
or more generally to models in which the dynamical range of each synapse is at least of order V-ZV- 
In the case of the Hopfield model, fhere is no sfeady sfafe, so fhe initial signal fo noise rafio is large 
(as given in fhe fable) really only for fhe firsl few memories. As more memories are sfored, fhe noise 
increases, and fhe signal fo noise rafio decreases as 1 /i/f, where t is fhe fofal number of sfored memories. 
Bisfable synapses have fwo sfable synapfic values and fhe fransifions befween fhem are sfochaslic ( |Amif| 
and Fusi jl994 1. Fasf synapses exhibif a large learning rale L (i.e. a Iransilion probabilily of 0(1)), 


whereas slow synapses are characferized by fhe slowesf possible learning rale (i.e. fhe smallesl Iransilion 
probabilily lhal keeps fhe initial signal fo noise ratio above Ihreshold, which is L = 0{1/\/N)). In 
fhe helerogeneous model (|Roxin and Fusi[ |2013|) fhe synapses have differenl learning rales (see also 


Secfion[S^. The cascade model is described in Fusi el al. (20051 and fhe mullislage model in Roxin and 


|Fusi| ( |2013] ). In fhe Iasi row we report fhe scaling properties of fhe model lhal we propose in Ibis article, 
which are superior fo olher bounded models, and differ only by logarifhmic factors from fhe unbounded 
case. Alfhough fhe approximafe scaling of fhe helerogeneous model is fhe same as for fhe cascade, fhe 
lalfer performs significanlly better (see Fusi ef ahj 2005 [ Fig. 8). If is imporlanl to remember lhal fwo 
models wilh fhe same scaling behavior may nol work equally well, as fhe coefficienls in fronl of fhe 
faclors reported in fhe fable mighl be quife differenl. However, if is unlikely lhal a model wilh a better 
scaling behavior would perform worse, as N is assumed to be very large. 
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S.8 Models with independent variables 


To illustrate why interactions between dynamical variables are beneficial for memory performance, let 
us briefly consider what would happen if the variables were completely decoupled from each other. We 
could assign different timescales to different (populations of) variables, which we can think of as inde¬ 
pendent, simple synapses. Incoming memories could then be stored in a distributed fashion in multiple 
synapses with a wide range of decay rates, such that at any point in time at least some of them retain 
a reasonably large memory signal (up to some maximal memory lifetime). A simple version of such 
heterogeneous models has been described in||Roxin and Fusi|(|2013|l. 


The synaptic variables could be inherently bounded, such as e.g. binary variables, or we could assume a 
priori unbounded synapses. Either way, such simple synaptic models characterized by a single timescale 
r correspond (at least in a mean field sense) to an exponential memory decay function r{t) ~ exp(—t/r), 
which is essentially flat up to a memory age of order r. In the latter case, even though the synaptic 
weights are a priori unbounded, we would still like to achieve a steady state distribution of synaptic 
weights that has a small variance, i.e. requires us to distinguish only a small number of different values 
at any point during online learning. This is not possible for simple unbounded synapses, however, since 
the standard deviation of the weight distribution grows as ^/T. In this context, interactions are useful for 
achieving long timescales while maintaining small dynamical ranges for all variable^ 

For binary variables with switching probability of order l/r, on the other hand, the expected mem¬ 
ory signal will be Sr{t) ~ (1/^) exp(—f/r). In this case the variance is naturally bounded, with the 
corresponding memory noise being approximately constant in time and independent of r. Thus when 
considering the signal to noise ratio for an agnostic readout that takes into account all synapses equally, 
we can simply average the signals, while dividing by the noise merely contributes a constant factor (such 
that S/Af = 0{y/N) for a homogeneous population of size N and fixed r). 


If we have a distribution of timescales p{t) within a population of such synapses, the total memory 
signal we expect can be computed by integrating over this distribution (adding the contributions from all 
timescales present). For example, for a power-law distribution p{t) ~ with p > 0 (which would 
have to be cut off at a smallest and/or largest timescale, and normalized appropriately) we would find a 
total memory signal 


Sp{t) = I dr p{t) Srit) ^ dr^^=t ^ t ^T{p) , (17) 

which (approximately) exhibits a power-law decay at times t smaller than some large cutoff T (but larger 
than some shortest timescale, which we could have incorporated above by writing a finite lower cutoff 
for the integral). 

A commonly used distribution of timescales is /?(r) ~ l/r, since this corresponds to a uniform distribu¬ 
tion on a logarithmic scale (dr/r = dlogr). In this case, the incomplete Gamma function in eqn. ( [T7| ) 

*Of course one can always trivially reduce this dynamical range by simply reducing the step size for each plasticity event, 
but again this would not improve the situation, because what really matters is the number of distinct values that the synaptic 
efficacy is likely to take during learning, i.e. the number of levels we would have to introduce when discretizing this variable in 
order to well approximate the dynamics in the unbounded case. 
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simplifies such that Sp{t) ~ (1/t) exp(— t/T), i.e. the signal exhibits a 1/t decay with a soft cutoff at 
the longest timescale T. If, on the other hand, we wanted to achieve the slower I/Vi memory decay with 
independent binary variables, we would need a distribution p{t) ~ I/v/t" that puts more emphasis on 
longer timescales. With this distribution eqn. leads to Sp(t) ~ erfc(, i.e. the desired 

power law, again with an exponential cutoff at times of order T. However, skewing the distribution in 
favor a longer timescales leads to a rather inefficient system (in terms of the number of variables needed 
to achieve a given initial signal to noise ratio) as we shall see below. 

When building concrete models, we usually do not consider continuous distributions of timescales, but 
instead a finite number of populations with appropriately chosen timescales, i.e. we approximate the 
desired decay function by a superposition of a finite number of exponentials. Again, it is natural to 
choose these timescales equally spaced on a logarithmic scale. We can parameterize them e.g. as 
for t = 1, 2,... m and for some real-valued n > 1, which allows us to efficiently cover a large range of 
timescales, in this case from 1 to T = using only a small number of timescales m = C>(logT). 

If these populations are of equal sizes, as in the most common heterogeneous binary model, the memory 
signal will be given by 


S{t) ~ exp(-t , (18) 

i=l 


which for 1 < t < T and n not much larger than one well approximates the 1 /t power-law memory decay 
(with the quality of the approximation deteriorating as n grows, since successive timescales become more 
widely spaced). Similar mechanisms also generate the effective 1/t memory decay of the cascade ( Fusi[ 
|et al.[ 20051 and multistage memory consolidation models ( Roxin and Fusi[[2013| , even though these 
models do contain interactions between different timescales. 


However, if we allow the populations of increasing timescales to become progressively larger, we can 
slow down the memory decay. In particular, if the population size grows as the memory signal 

m 

S{t) ~ exp(-t n-2(*-i)) , (19) 

i=l 


will approximate a 1/Vi decay under the same conditions as in the previous paragraph. The crucial 
difference, however, lies in the total number of variables N needed to achieve a given initial signal to 
noise ratio, which is dominated by the variables in the first population (with t = 1; the other populations 
do contribute, but the sums in eqns. (181 and ( [T^ converge to numbers of 0(1) even at t = 0). For 
each such variable we only needed 0(log T) others to achieve the 1/t decay, and consequently the initial 
signal to noise ratio in that case was Sq/Mq = 0{VN /logT). For the 1/Vi decay however, each 
variable in the fastest population corresponds to 0{VT) others with slower timescales, and as a result 
the initial signal to noise ratio drops to Sq/Mq = 0{V^VJT). This implies that we would need N ^ T 
variables to achieve a reasonable initial signal to noise ratio, and in particular that this construction cannot 
extend the memory lifetime beyond 0{VN)- 


More generally, even though independent variables can be used to shape almost arbitrary monotonically 
decreasing memory functions, slowing down the decay requires investing a larger number of them in 
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longer timescale populations, which detracts from the initial signal to noise ratio, and ultimately is not 
helpful in extending the overall memory lifetime. As we have shown, however, introducing properly 
tuned interactions between variables of different timescales can overcome these limitations. 


S.9 Relation to Markov chains and number of states 


Given the quantization of the dynamical variables described in Section M.5[ our model of a complex 
synapse could be rewritten as (and is mathematically equivalent to) a simple Markov chain, with the 
input term biasing the transition matrix depending on the memory to be stored at a given time step. 
A state of this Markov chain would be a joint assignment of all the quantized variables, and thus if 
every variable had L levels the total number of states would be L™, which can be a huge number. In 
particular, we have seen that it is sufficient to have L = 0{y/\ogT^), and given that m ~ 2c log A” for 
some constant c the number of states would be of order (log i.e. superlinear 

in N. Due to this large number of states, our model is consistent with the general bounds on memory 
performance of Markov chains derived in Lahiri and Ganguli (2013[). 


The corresponding transition matrix would be an enormous by L™ matrix, but writing out the dy¬ 
namics in this language would completely hide the underlying structure of the interactions. Even though 
the two points of view are ultimately equivalent, the system is much simpler to describe in terms of inter¬ 
acting variables, rather than listing all their possible states and transitions between them, and we believe 
the former description is also more relevant in terms of informing possible biophysical implementations. 


S.IO Chemical reaction interpretation 

The basic linear chain model we devised is essentially a particular spatial discretization of a one-dimensional 
diffusion process, and diffusion equations are ubiquitous in nature. For this reason there are many con¬ 
ceivable ways of implementing the required dynamics in terms of physical variables. The Ui variables 
may in reality be an effective, coarse-grained description emerging from complex microscopic dynam¬ 
ics, but here we would like to give one simple example of a possible implementation of our model in 
which the Ui can be identified direcfly wifh microscopic variables, namely (appropriafely normalized 
changes in) concenfrafions of chemicals inside a synapse. This naive implemenfafion is nof mean! fo 
capfure fhe complexify of fhe biochemical processes fhaf are involved in synapfic memory consolidation. 

If is merely a simple example mean! fo demonsfrafe one possible mechanism fhaf could insfanfiafe fhe 
proposed dynamics. 

This inferprefafion of fhe model will lake fhe form of a chain of equilibrium chemical reactions such as 

Ai+ Xi ^ A2 -h T2 ; ^3 + -^3 ^ A4 -h T4 ; A5 -h A5 ^ Aq + Yq 

A2 + X2 ^ As + Y3 ; A4 + X4 ^ As + Y5 


One can imagine fhaf fhe Aj are cerfain species of biomolecules conlained inside a synapse fhaf parlic- 
ipafe in equilibrium (hence bidirecfional) reacfions wifh ofher molecules of species Xi and Yi. These 
ofher molecules may be small, ubiquitous, and do nof necessarily have fo be confined inside fhe synapse. 
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Each of the Xi and Yi symbols may in fact represent a set of several (possibly distinct) molecules, or 
even the empty set (e.g. if the Xi are groups that can bind to a large biomolecule Ai to form Aj+i there 
may be no other reaction products Fj+i at all). 

For each reaction there will be a chemical equilibrium condition, such as e.g. A'lX^ / (A 2 Y 2 ) = const, for 
the first reaction, where (in an abuse of notation) we have used the same symbols that label a chemical 
species to also denote their concentration and it is understood that if any of the Xi or Yi include multiple 
molecules (possibly of different species) we have to multiply their concentrations accordingly. Here and 
in what follows, stars indicate equilibrium (or more generally steady state) quantities. 

If we were dealing with an elementary reaction whose dynamics can be described by collision theory, 
we could write more explicit equations such as r ^2 = '’'12 ~ fcr 2 ^ 2^2 for the rates rf 2 

of the forward and backwards reactions, where kf 2 and k ^2 are constants. In equilibrium, both of these 
would be equal to the steady state reaction rate ^12 — ^l~2^1"^l = fe]^ 2 ^ 2 f^ 2 *’ reproducing the above 
equilibrium condition. 

Since we have assumed that the Xi and Yi are so common that their concentrations are effectively un¬ 
changed by small perturbations of the equilibrium, introducing e.g. a small excess amount AAi changes 
the forward rate of the first reaction by Ar^ ~ kf 2 A^iXi = rjj '2 AA\/A\ to first order. This will then 
lead to the production of more A 2 , which in turn will increase the rates of the reactions in which it par¬ 
ticipates by Ar ^2 ^rid Ar (depending in a similar fashion on AA 2 ), and in this manner the perturbation 
denoted by AAi will spread along the chain of reactions. 

In fact, under the assumption of simple rate equations (collision theory), small perturbations of this 
system of equilibrium chemical reactions lead to differential equations equivalent to those describing 
random walks (or spatially discretized heat diffusion), such as 

= ^^^2 - ^^ r 2 - ^^^3 + 

, (AA^ AA 2 \ , (AA 2 AA^\ 

A* Al ) ^23 J , 

and similarly for the other AAi. Comparing this to eqn. we see that our dynamical variables Ui 
can be identified wifh renormalized deviafions from fhe equilibrium concenfrafions: Ui ~ AAi/A*. 
The crucial quanfify fhaf needs fo be fracked fo determine fhe synaptic efficacy would fhus be AAi/A\. 
Furfhermore, from Fig.[T]or eqn. ( [TT] ) we can read off fhe relevanf parameters: Sfeady sfafe concenfrafions 
A* acfs as heaf capacities (or cross-sectional areas of beakers) and equilibrium reaction rates r*i_^i play 
fhe role of thermal conductivities (or tube sizes). The appropriate tuning of these parameter would thus 
suggest exponentially decreasing reaction rates along the chain, and exponentially increasing equilibrium 
concentrations. 

We can easily generalize this to a chain of equilibrium chemical reactions with arbitrary stoichiometric 
coefficients 


-h Ai ^ P2^2 + ^2 ; A3A3 -h A3 

X 2 A 2 + X 2 ^ P3^3 + ^3 ; 


/J 4^4 + ^4 ; A5A5 -|- A5 ^ P6^6 + Ae 
A4A4 -|- A4 ^ P5^5 + ^5 ■ • • ) 
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which, again assuming simple rate equations and expanding them to first order in small perturbations, 
merely changes the identification of the dynamical variables to 


Ui ~ 


n}-2 Pi AA, 
A- 


It is clear that one can further extend this type of model implementation to complex networks of equi¬ 
librium reactions analogous to the general graph models discussed in Section M.6| If we are willing to 
forgo the one to one identification of our dynamical variables Ui with individual microscopic quantities 
(such as concentrations) we can even imagine networks of non-equilibrium reactions that implement the 
required dynamics with the m interpreted as effective variables (akin to diffusion on directed graphs, 
which makes sense as long as every node can be reached starting from every other node). 


While assuming collision theory likely is overly simplistic, the above results really only rely on the fact 
that a small perturbation to chemical equilibrium leads to changes in reaction rates that are proportional 
to the excess concentrations AAi to a first approximation, which is not implausible. 

Also, while declaring the concentrations of Xi and Yi constant is convenient and leads to very simple 
equations, we could consider the case in which they do change significantly, and therefore influence fhe 
reactions rales and fhe resulting dynamics of fhe AAj. In fhaf case, however, if would be imporfanf fhaf 
fhese Xi and Yi are confined and Iheir numbers preserved wilhin fhe synapse, since e.g. an influx of such 
molecules from external sources could change fhe dynamics in imporfanf ways. This is in conlrasl lo 
fhe case we have considered, where only fhe Ai molecules have lo be confined and preserved inside fhe 
synapse, up lo perhaps a slow decay or leakage of fhese molecules over time intervals of fhe order of 
fhe longesl implemenfed timescale. Since molecular lumover is nof arbifrarily slow, however, fhe simple 
inlerprelafion presenled in Ibis seclion does nof seem suilable for very long-ferm memory, which likely 
requires mulli-slable mechanisms. 


S.ll Real time versus event-triggered dynamics 

In fhe discrete time simulations of Figs. and we have for simplicity assumed one new memory to 
be stored for each time step, but in reality there is no reason to think that new memories necessarily 
arrive at a constant rate. Generalizing to arbitrary, randomized timings of (desired) plasticity events does 
not pose a fundamental problem from a modeling point of view, either in a continuous time framework 
or in discrete time simulations with only some time steps coinciding with non-zero inputs. However, 
the question arises whether the internal dynamics of the model variables should be running even when 
there are no inputs, or whether their interactions should be triggered only by (desired) potentiation and 
depression events, with the dynamical variables otherwise frozen. Both scenarios are conceivable. 

We can consider describing the input statistics by a distribution of inter-plasticity event intervals. As 
long as this distribution is sufficiently concentrated around its peak (i.e. has no long tails), such that the 
variance of the synaptic efficacy remains small enough fo avoid biffing fhe boundaries of ifs dynamical 
range too frequenlly, we wouldn’f expecf fhe model behavior lo be subslanlially differenl from fhe case 
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discussed abov43 

If the internal dynamics is event-triggered with a certain mean rate of inputs, the model dynamics should 
he equivalent to the case of constant (and equal) rate of arrival of memories, on timescales much longer 
than the inverse of that rate. On the other hand, if the internal dynamics is always running in physical time 
(regardless of inputs), we would have to adjust the inverse timescale a in eqns. or (|^ appropriately 
to achieve the same effective behavior on long timescales. 

There are significant differences between these models, however, when the distribution of inter-memory 
input intervals is broad, e.g. when bursts of many experiences to be stored are followed by long periods 
of silence (e.g. sensory deprivation). An event-triggered mechanism could handle such variability in 
time rather easily (since physical time plays no role in it, as nothing happens when there is no input, 
and thus only the number of memories to be stored matters). On the other hand, an always-on internal 
dynamics would run into trouble given a small dynamical range, as mentioned above (or equivalently, 
would require a larger dynamical range to achieve the same memory performance). 

Thus, input-triggered dynamics appears to have an advantage in terms of flexibility (it can easily adapt to 
changing input conditions), but physical time dynamics appears to be much simpler to implement (e.g. as 
in Section [S.10[ where one possible way of identifying the Ui with physical variables is suggested), with¬ 
out the additional layer of complication required to build appropriate triggering mechanisms. One could 
even imagine hybrid schemes in which only shorter timescale variables have event-triggered dynamics. 

A number of secondary, and more subtle issues arise when we consider the case of discretized variables 
Ui with stochastic transition dynamics. One would like to keep the additional variance (noise) due to the 
stochasticity of the internal dynamics small, which may again favor event-triggered dynamics. Taking 
this even further, and noting that the transition probabilities for longer timescale variables become pro¬ 
gressively smaller (i.e. transitions become rarer), one could then devise a scheme in which these variables 
are not updated every time a new input arrives, but even more infrequently, say on timescales on which 
their cumulative transition probability is of order one half. 


S.12 Memory retrieval 

In order to study memory retrieval we have to consider a particular neural circuit and specify its ar¬ 
chitecture. Here we chose a feedforward, perceptron-like architecture, and a fully connected recurrent 
neural network. In both cases the stored memories were random and uncorrelated. More specifically, 
in the feedforward case we simulated one neuron receiving N inputs. The memories were stored by 
imposing a random input pattern on the inputs and its associated desired output on the readout neuron. 
The neural activities of both the input and the output were ±1, chosen randomly with equal probability. 
Each memory was stored by modifying the synapses according to a simple covariance rule, similar to the 

^More generally, if we allow plasticity events of varying sizes, we could consider the joint distributions of inter-plasticity 
event intervals and event sizes, or perhaps even several such joint distributions that might be different for potentiation and 
depression, and possibly depending on the nature of the previous plasticity event. Again it would be important for the synaptic 
weights to not spend too much time at the edge of their dynamical range, which they might hit now either due to bursts of 
plasticity events of the same sign at an unusually high rate, or due to plasticity events of exceptionally large magnitude. 
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Figure S6: A. Probability of correct retrieval as a function of the memory age for a feedforward network. 
The baseline corresponding to chance level (pcorrect = 1/2) has been subtracted. Different colors corre¬ 
spond to different numbers N of synapses, which for this architecture equals the number of inputs. The 
dashed line corresponding to Pcorrect = 0.9 is the arbitrary threshold for retrievability (i.e. a memory of a 
certain age is called retrievable if the feedforward network generates the correct output with probability 
larger than 0.9). B. Linear scaling of the number of retrievable memories as a function of the number of 
synapses N. The number of retrievable memories is determined by finding the intersection of the dashed 
line with the Pcorrect curves of panel A. The black circles are the results of simulations and the line is a 
linear fit (the slope is 0.027). We used m = 8 dynamic variables discretized with 40 levels each. The 
parameters have not been optimized to maximize the memory capacity (e.g. the optimal m would change 
depending on N). 


prescription used in the Hopfield network (Hopfield]|l982 1. For the memory stored at time t' 




where is the activity imposed on the presynaptic neuron j and xi't') is the desired output. This 
Awj{t') determines how our complex synapses are updated (when it is positive, the synapse is potenti¬ 
ated, when it is negative, the synapse is depressed). Each memory was stored only once. 

After all memories had been stored, we tested whether they could be retrieved by choosing one specific 
pattern ^j(f') and using it as the input. The activity of the output neuron was determined by computing 
the sign of the weighted sum of the inputs, i.e. sign( Wj (f')). A memory was counted as correctly 
retrieved when this output matched the desired output x(f) that was stored during memorization. 


Scaling properties (feedforward network) The ideal observer approach predicts that the number of 
storable memories should scale linearly with the number N of synapses. In the neural circuit that we 
considered, the number of synapses is equal to the number of inputs. To test the prediction of the ideal 
observer approach, we progressively increased N and determined the maximum number of memories 
that could be retrieved correctly. To estimate the number of retrievable memories, we ran multiple 
simulations to determine the probability Pcorrect that a memory is retrievable when followed by some 
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number of subsequently stored memories (this number is basically the age of the memory that we are 
tracking, see Fig. S6\). If ^correct = 1 we have perfect retrieval, whereas Pcorrect = 1/2 corresponds to 
chance level, as the probability of guessing the correct output is 1/2. The number of retrievable memories 
is estimated by determining the largest age for which Pcorrect > 0.9. This number increases linearly with 
N (see Fig.jS^), as predicted by the ideal observer approach. 


The memory signal of the ideal observer approach is (the normalized expectation value of) the overlap 
x{t) Cjii) if we assume that the distribution of this quantity is Gaussian, the probability 

^correct of it being positive (i.e. retrieval being correct) is 


_ 1 

^correct — TT 



where S/M is the ideal observer signal to noise ratio. Thresholding Pcorrect is thus equivalent to thresh¬ 
olding S/M, and we have shown that the capacity defined by the latter prescription grows (almost) 
linearly with N. 


Generalization and signal to noise ratio In the simple neural circuit that we considered it is easy to 
estimate the generalization ability of the network, which is related to the strength of the ideal observer 
memory signal. Indeed, consider the feedforward network we used to validate the scaling properties. 
To assess the ability to generalize one can degrade the quality of the input cues, and determine the 
maximum degradation that is tolerated by the neural circuit, i.e. that still produces the correct response. 
We decided to degrade the inputs by flipping the sign of their components with probability e. This form 
of degradation reduces the ideal observer signal by a factor 1 — 2 e, leading to: 


= 5 - 5 erf 


1 - 2e 


s/m'^ . 


( 20 ) 


If we demand that pcorrect = 0.9 as above, the argument of the error function must be equal to some 
constant (that has to be determined by inverting the error function). The minimal tolerated similarity 
1 — 2 e between the stored input and the cue use for retrieval will therefore be inversely proportional to 


the SNR. This prediction is verified by the simulations in Fig. S7 


Recurrent network and attractor dynamics We also simulated a Hopfield network ( Hopfield] 1982) 
with complex synapses to study recall in the recurrent case. Here we apply the usual Hebbian learning 
rule Awij{t') = to a fully connected, symmetric network of W + 1 binary neurons (the total 

number of synapse^being N{N + l)/2). Retrieval is now a multistep procedure that iteratively and 
asynchronously updates randomly chosen neurons i by setting their activity to sign( ?j)> where 

is the current state of the neural ensemble that was obtained starting from some cue related to one of the 
stored patterns The cue may again be corrupted by randomly flipping a fraction e of its components. 
Recently stored memories corresponds attractor states under this recall dynamics (see Fig.|S8|). 


'“As noted in the discussion, even though the total number of synapses is of order 0{M), the number of synapses relevant 
for the ideal observer analysis is N, since this is the maximum number that can receive independent inputs. Alternatively, we 
can think of computing the signal to noise ratio for the local circuit consisting of afferent neurons connected to a particular one, 
in which case the relevant number of synaptic weights is again N. 
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Figure S7: A. Generalization ability of a feedforward network storing random uncorrelated memories. A. 
Relation between the generalization ability and the ideal observer signal to noise ratio. The generalization 
ability is expressed on the vertical axis as the minimal overlap between the stored memory and the cue 
used for memory retrieval that leads to successful retrieval (1—2 e, as discussed in the text). The black line 
is the prediction of the theory and the red dots are the results of simulations. Both the overlap and the SNR 
are on logarithmic scales and the line on the plot expresses the predicted power law (1 — 2 e oc 1/SNR). 
B. Generalization ability as a function of the memory age, or equivalently, the number of memories that 
are stored after the tracked memory. The generalization ability is expressed as the maximum degradation 
e that can be tolerated, i.e. that still leads to successful retrieval. Different curves correspond to different 
numbers of synapses N (which in this case of a feedforward network is also the number of inputs). 
The strong initial SNR allows for very large tolerated degradation (e = 0.5 corresponds to a completely 
uncorrelated memory). As the SNR decays the maximum tolerated degradation also decreases, and 
becomes zero when the memories can no longer be retrieved correctly even when the cues used for 
memory retrieval are not degraded. The circles are the results of the simulations and the lines are the 
theoretical prediction that follows from eqn. ( 201 , i.e maximum e = 1/2 — where 7 was fitted 

to the curves (7 = 3.05), and t is the memory age. The agreement between theory and simulation results 


is remarkably good. All other parameters are the same as in Fig. S 6 


S.13 Sparse Representations 


Representing memories by sparse patterns of neural activations has long been known to carry certain 
computational benefits (Tsodyks and Feigel’man 1988 Amit and Fusi[ 19941. In particular, it allows us 
to store a larger number of patterns. It is important to appreciate however, that storing a larger number of 
items by decreasing the coding level does not necessarily imply an increased total information content, 
since a sparse pattern contains less information about the memory it represents (compared to the maximal 
amount of information that can be stored in a dense pattern of the same size). 


In the main text, we have considered binary valued patterns of synaptic modifications with both values 
(±1) equally likely and independently distributed, which implies a dense representation. In this case the 
information density of the stored patterns is maximal {N bits for a binary pattern of size N). Even though 
we have limited the discussion to the case of dense coding above, we can easily combine the advantages 
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Figure S8: Overlap between the retrieved state and the stored memory patterns versus age of the memory 
in a simulation of a fully connected Hopfield network of binary neurons. The recall procedure is deter¬ 
ministic and asynchronous. For every memory we cycle through (and update the activity of) all neurons 
ten times in a random order, after which the resulting state typically is either very close to the stored 
memory (an overlap of one indicates perfect retrieval) or very far from it (an overlap of zero is chance 
level). The blue line corresponds to using an uncorrupted cue for retrieval (e = 0), while the red line 
corresponds to e = 1/4. In both cases recent memories are recalled flawlessly, while for very old ones 
retrieval doesn’t converge to an appropriate attractor state (even though the quality of one-step feedfor¬ 
ward retrieval may still be high), with large fluctuations between those two extremes in an intermediate 
age range. The number of neurons was 30000, and for each synapse we used m = 4 variables discretized 
with 30 levels each. 

of sparse coding with the beneficial properties of our complex synaptic model if we are interested in 
increasing the number of stored patterns (at the expense of the amount of information stored per pattern). 


Neural architectures and retrieving information In order to discuss sparse coding we will have to 
choose a particular architecture. Here we consider the usual perceptron-style setup with N presynaptic 
neurons (indexed by i) afferent to a single (postsynaptic) neuron that performs a binary classification of 
patterns. In order to perform such a classification correctly (i.e. to reconstruct one binary feature of the 
input pattern), the readout will have to retrieve one bit per pattern from the set of N afferent synapses, 
where this information has to be storecf^ 

We can generalize this architecture further by considering a number of parallel, independent readouts, 
receiving inputs from the same set of N presynaptic neurons, but through separate sets of N synapses 
for each of them. While a single binary readout can reconstruct at most one bit about each of the stored 
patterns of presynaptic activity, some number R of parallel readouts can reconstruct up to R bits of 
information about any one of the stored patterns using a total of N x R synapses. In particular, if we 

"This is most easily seen for (hetero)associative memory, in particular if the correct outputs are a priori independent of the 
inputs. In this case the recall cue alone (without the learned weights) can provide no information about the correct outputs. 
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consider N parallel readouts, the architecture becomes a feedforward mapping between two layers of N 
neurons each (and using synapses in total 

There are now two distinct notions of capacity we can consider, the number of patterns that can be 
recalled at a given time (with high fidelity), and the total amount of information about them that can be 
read out from the synaptic weights. For independent memories this total information capacity is equal 
to the sum over all stored patterns of the retrievable information about each of them. Note that the 
information that is actually retrievable by a particular readout may be smaller than the total amount of 
information stored in the synaptic weights. 

In the case of dense coding there are N bits to be reconstructed for every pattern, for which we require 
N independent readouts, and if all of them can be retrieved successfully the total information capacity 
will be N times the number of patterns that can be recalled. Since the number of patterns scales almost 
linearly with N this leads to a total of almost 0{N‘^) bits stored in synapses (plus possibly some 
residual information about older patterns that cannot be recalled with high probability anymore). Because 
each of our synapses has only a small number L of distinguishable states of its efficacy (fhe sfafes of fhe 
infernal variables nol being accessible fo fhe readouf), we cannof hope fo increase fhe fofal informafion 
capacify much furfher, since in fhis case log 2 L bifs per synapse is a sfricf upper limif for fhe informafion 
fhaf can be read ouf. In our model L is relafively small and grows only very slowly wifh fhe number of 


fhere are only fwo sfafes and fhe upper limif is one bif per synapse. In bofh of fhese cases, fhe informafion 
capacify is close fo ifs upper bound. 


synapses (as -^log N). For fhe case of a complex synapse wifh binary weighfs discussed in Secfion S.4 


We can, however, increase fhe number of pafferns sfored by reducing fhe coding level, while keeping 
fhe fofal information confenf approximately consfanf. If we assume for simplicify fhaf fhere is a sharp 
fransifion befween recenf memories fhaf can be reconsfrucfed perfecfly, and older ones of which hardly 
any informafion can be refrieved, fhe fwo nofions of capacify will be proporfional for any fixed coding 
level /, wifh fhe consfanf of proporfionalify being fhe informafion confained in (or equivalenfly fhe 
Shannon enfropy of) a single paffern, which is —A^/ log 2 (/) — A^(l — /) log 2 (l — /). When / is small, 
fhe informafion per pattern scales approximafely as N f and is reduced by a faclor / wifh respecf fo fhe 
dense case. As we will see, sparseness may allow us fo store a number of patterns fhaf is a faclor 1// 
larger, leading fo an informafion capacify fhaf is approximafely fhe same as in fhe dense case and again 
close fo ifs upper bound. 


Note fhaf for simple synapses such as fhe bisfable synapses sfudied in Amil and Fusi (19941, fhe infor¬ 
mafion capacify can be very differenf for fhe dense and fhe sparse case. Indeed, for bisfable synapses 
fhe maximum number of refrievable dense patterns scales like Vn, nol (almosl) N as in fhe proposed 
model, so fhe information capacify is far from ifs upper bound. In conlrasl, for sparse represenfalions, fhe 
number of refrievable patterns scales like N‘^, salurafing fhe informafion capacity. Neverlheless, synapfic 
complexity remains imporlanf for al leasl fwo reasons: fhe firsl one is fhaf fhe initial signal fo noise rafio 
is much larger for complex fhan for simple models, and fhe second one is fhaf fhe N‘^ scaling requires 
a very small / which would nol be compatible wifh fhe observed / (see below for a more exlensive 
discussion). 


*^Of course, if the task is to reconstruct the original input patterns one to one (as in autoassociative memory), some part of 
the correct pattern will have to be provided in the form of a corrupted cue for every recall trial, and in this case the cue does 
contain relevant information. 
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Signal to noise ratio for sparse coding We can perform a signal to noise ratio analysis similar to that 
of Section M.2| for the case of sparse representations. To this end, we have to generalize the definition 
of the memory signal of eqn. Q slightly, in a way that makes explicit the binary decision the readout 
neuron faces when retrieving a previously stored pattern. We can write the signal as 




N 


2N\ 


( 21 ) 


i=l 


where the binary variable x denotes the response of the postsynaptic neuron (imposed at the time the 
pattern in question was stored) and v{t') is a readout vector appropriate for retrieving the memory stored 


at time t'. This can be viewed as an extension of the signal to noise ratio defined in Amif and Fusi 
(1994 1 ; Rubin and Fusi ( 2007| , where populations of neurons were sfudied and v was jusf fhe paffern of 
pre-synapfic acfivify imposed on fhe nefwork fo frigger memory refrieval. 

For fhe noise (squared) we have 

^ Var + ^Var (w^i(f)lx(t')=o ■ 


These definifions allow us fo consider fhe signal fo noise rafio beyond fhe ideal observer approach (which 
would imply v{t') = Aw{t')) and beyond fhe case of densely coded oufpufs (where x = 0 and x = 1 
are a priori equally likely) fhaf we have discussed in fhe main fexf. Here and in whaf follows, we wrife 
equafions for a single posfsynapfic neuron (dropping fhe posfsynapfic index), since fhe generalizafion fo 
several independenf oufpufs is frivial. 


Sparseness with covariance learning rule In order fo sfudy fhe effecfs of sparse coding on fhe signal 
fo noise rafio we also need fo specify a learning rule (fhaf furns neural acfivify pafferns info pafferns 
of desirable synapfic modificafions Aw). A simple learning rule we will consider is fhe Hebbian-fype 
covariance rule ( [Tsodyks and Feigel’man[ 1988; Sfanfon and Sejnowski[ 19891 discussed below eqn. Q, 
which in fhe case of a single posfsynapfic neuron reduces fo 


Awi{t) oc (^i(f) - a)(x(f) - b) 


where insfead of a single coding level / we disfinguish befween fhe pre- and posfsynapfic coding levels 
a = (^i) and b = (x), assuming for simplicify fhaf fhe binary variables and x take values 0 or 1. We 
will fake fhe readouf vecfor fo be of fhe form Vi{t) = ^i(f) — c for some consfanf c, i.e. if depends solely 
on fhe appropriafe paffern of presynapfic acfivifies. 


Assuming spafially and femporally uncorrelafed inpuf patterns ^{t) and associafed oufpuf labels x(^)> 
expecfafion values can easily be computed simply by mulfiplying fhe probabilifies of fhe independenf 
pre- and posfsynapfic acfivafions (i.e. a or 1 — a for fhe pre- and 6 or 1 — 6 for fhe posfsynapfic side) and 
summing over all combination^^ 


*^Note that we are not averaging over the labels x for ths memory being recalled since the output is explicitly fixed to the 
two possible values in the two terms of eqn. 1 21 1 . 
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This leads to the signal to noise ratio (considering the memory stored at t' = 0 without loss of general- 

iiyfl 


S/Af{t) = C{a, 6 , c) 



Nr‘^{t) 


where C 


2 - 1^6 (1 — b){a — 2ac + c^) 


( 22 ) 


Apart from the coefficient C depending on the parameters a, b and c, this is the same signal to noise 
ratio as in eqn. Q. In particular, the dependence on N and the kernel r{t) is the same, and thus all the 
same considerations as above apply. The optimal time dependence is still r{t) ~ and therefore 

the memory lifetime will scale as the square of the initial signal to noise ratio, or equivalently the square 
of the overall coefficient of the signal to noise ratio, which (for fixed N) can be made larger by making 
neural represenfafions sparser. 


The opfimal value of c fhaf maximizes fhis coefficienl is given by c = a, in which case we find C = 
1/(2y^6(1 — b)). In ofher words, fhe besf readouf is one in which fhe readouf vector v is proporfional 
fo Aw. This case is closely relafed fo fhe ideal observer approach discussed in fhe main fexf, and in 
facl if in addifion fhe posfsynapfic represenfafions are dense, i.e. b = 1 / 2 , fhe more general definilion of 
fhe signal fo noise rafio in eqn. ( |2T] ) reduces to fhe previously discussed case of eqn. Q, and we obfain 
exacfly fhe same resulf as above. 


On fhe ofher hand, if we consider a readouf proporfional fo fhe presynapfic acfivify (i.e. c = 0, which 
mighf be considered more realisfic fhan fhe above if we view fhe readouf neuron as a linear classifier 
wifh fixed fhreshold computing fhe overlap w.^) fhe numerical coefficienl in fronl of fhe signal fo noise 
ratio simplifies to C = \/l — a/{2y/b (1 — b)). Unlike case of fhe optimal readouf (wifh c = a), Ihere 
remains a dependence on fhe presynapfic coding level. We will disfinguish fwo scenarios: one in which 
fhe presynapfic coding level remains consfanf as b changes, and one in which a = b. 


In all of fhese cases fhe coefficienl in fronl of fhe signal noise rafio grows asymplofically as 1 /Vb as 
fhe oulpul coding level is reduced, and Iherefore (since fhe opfimal lime dependence is still Xjyft) fhe 
memory lifefime (fhe number of palfems fhaf can be recalled al a given lime) grows asymplofically as 


oc 1/6, as shown in Fig. S9 


Sparseness with Amit-Fusi learning rule We can perform a similar analysis for fhe perhaps more 
biologically plausible Amil-Fusi learning rule ( Amif and Fusi[ [1994 1, which pofenfiales a synapse by 
a cerlain amounf if bolh fhe pre- and posfsynapfic neurons are aclive, and depresses if by a fraclion 
//(I — /) of fhaf amounf when only fhe presynapfic neuron is aclivaled, where / is fhe coding levej^ 
When fhe presynapfic neuron is quiescenl, no plaslicify evenls occur. Note fhaf fhis presynaplically 
gated learning rule is balanced, in fhe sense fhaf fhe expecfalion value of fhe weigh! change vanishes. We 
again use a readouf vecfor fhaf is linearly relafed to fhe presynapfic acfivify, i.e. Vi{t) = ^i(f) — c, wifh 
0 < c < 1 . 


*''The normalizations of Aw and v don’t matter here, since they cancel in the signal to noise ratio. Also, the two terms in the 
expression for the (squared) noise turn out to be equal to each other. 


15 


Here we take the pre- and postsynaptic coding levels to be equal (both /) for simplicity, as in 


Amit and Fusi 


( 19941 . In 


that paper a whole family of learning rules was studied for simple binary synapses. Here we consider just one particularly 
straightforward rule, according to which potentiation events occur with probability and depression events with probability 
/(I — /), and apply it to our model of a complex synapse. 
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Figure S9: Doubly logarithmic plots of the robust recall capacity versus postsynaptic coding level b for 
the case of the covariance learning rule. Data points are simulations of the fully discretized version of our 
synaptic model with m = 5 variables and N = 2000 synapses. We call a pattern robustly recalled if the 
probability of a readout neuron classifying it correctly is larger than 0.99 (we estimate this probability 
by sampling 500 independent readouts). This very stringent criterion (which leads to numerically rather 
small capacities) is necessary to study the scaling behavior, since the chance level for recall grows as 
we decrease b, and the threshold must be chosen high enough such that even for the sparsest patterns 
considered it is still well above chance level. For each data point we numerically optimize the bias term 
fixing the location of the decision boundary, which unlike the case of dense coding is no longer zero. We 
normalize the learning rule such that {Awi) = 1 for all i, while the spacing between adjacent levels of 
the discretized variables remains unity. The three curves (from top to bottom) correspond to the optimal 
readout (c = a, shown here with in blue with constant a = 1 / 2 , though the numerical results for a = 6 
are not significantly different), the simple readout (c = 0 ) with a = b (red), and the simple readout 
with a = 1/2 (green). The dashed lines indicate the corresponding predictions of eqn. (22), where one 
parameter common to the three curves (the overall magnitude) has been adjusted to match the data. This 
leads to a good fit, even though the calculation leading to eqn. ( 221 is agnostic about the implementation 
of the internal synaptic dynamics and the discretization of variables. 


In this case the signal to noise ratio of eqn. (| 2 T]) is 


S/Mit) = C 



Nr‘^{t) 


with C 


1 — c 

V(i - /)M(c - /)2 + /) + Nf{i -/)(c - fy ’ 


which again differs from the result for dense coding only by a coefficient (here called C). The compu¬ 
tation of the noise term leading to this result is slightly more involved than for the covariance learning 
rule, since even though (Amj(f')) = 0 for the stored patterns (averaging over the labels x)> they are 
not spatially uncorrelated, i.e. for a given pattern {Awi{t') Awj{t')) doesn’t vanish for i 7 ^ j. This is 
easy to see, since according to the learning rule all nonzero components of patterns for which the output 
neuron was active are positive (whereas they are all negative if the output neuron was quiescent). These 
correlations introduce a dependence on N in the coefficient C, which multiplies the signal to noise ratio 
and therefore rescales the number of patterns that can be recalled (oc C^) 
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In the case of the simple readout (thresholding X]j=i i-®- ^ = 0), this coefficient reduces to 

C = 1/ y^4/ (1 — /) (1 + /2 [N — 1)). Even though the number of patterns that can be retrieved grows 
as the coding level decreases, correlations severely limit the capacity unless N is small (see Fig. SlOl. 


A much better readout (almost, but for finite N not precisely optimal) is again obtained by setting c = f. 
In this case we have C = 1/^/4/, i-e. we expect the capacity to grow in inverse proportion to /. While 
asymptotically (as / —;■ 0 for fixed N) equivalenf fo fhe simple c = 0 case, this readout leads to a much 
larger capacity when Np is not small, because it circumvents the problem of spatial correlations of 


individual patterns of synaptic modifications (see Fig. SlOl. Note that this is in essence a prescription 
for learning a separate threshold (or bias term) for every readout neuron, in a manner that is linear in the 
sum of the incoming weights, while before we had assumed that all readouts have the same bias (though 
we numerically optimized this common value for each combination of / and N). 


The results of this section show that we can easily combine the computational benefits of complex 
synapses, which provide us with a close to optimal information capacity even for dense representa¬ 
tions, and those of sparse coding, which allow us to further increase the number of patterns stored (at the 
cost of reducing the amount of information per pattern, such that the total information capacity remains 
roughly constant). 


We have seen that signal to noise ratio calculation for the abstract model of a synapse with decay function 
r{t) and sparse coding factorizes into a kernel-dependent part (which is identical to the case of dense 
coding and leads to all the same considerations about the optimal kernel ~ and its implementation 
using internal synaptic dynamics), and a part depending on the coding levels of neural representations, 
readout schemes, and the learning rule, which can be optimized separately. 

The above computation for the deterministic Amit-Fusi rule can be generalized to the stochastic case 
(e.g. with a synaptic depression step that has the same magnitude as the potentiation step, but occurs 
only with probability //(I — /) if the pre- but not the postsynaptic neuron is active, i.e. the conditions 
for synaptic depression are otherwise met). More generally, the concrete implementation of our fully 
discretized (Markov chain) model is of course always stochastic, even if the learning rule determining 
Am is not. At this more detailed level of description (which we will not investigate further here), the 
parameters of the learning rule and neural representations can no longer be viewed as entirely indepen¬ 
dent of those of the synaptic model. Instead, they should of course be chosen to complement each other, 
e.g. given a certain number and spacing of discrete levels for the synaptic efficacy, the learning rule 
should be normalized to optimally use the available dynamical range. 
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Figure SIO: Doubly logarithmic plots of the robust recall capacity versus coding level / for the case of 
the Amit-Fusi learning rule. The parameters and recall procedure are otherwise as for Fig. except 
that we vary the number of presynaptic neurons, from N = 500 (yellow) to = 4000 (blue). A: Simple 
readout with presynaptic activity (c = 0). Spatial correlations of the patterns of synaptic modifications 
severely limit the capacity for moderate coding levels in this case, and only for very small / (when 
N <C 1) does the growth of the capacity approach 1//. Solid lines are numerical simulations, while 
dashed lines indicate the predicted scaling with where one parameter (the overall magnitude) has 
been fit for this family of curves. B: For the balanced readout (c = /) the capacity is much larger and 
grows essentially linearly with = 1/(4/) (dashed lines, again a one parameter fit). 
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